ahv

nutanixist 22: the impact of system consistency in ahv vs esx and other systems

September 27, 2025 by kostadis roussos Leave a Comment

AHV prioritizes system consistency over workloads, whereas ESX and every other OS prioritizes workloads over system consistency.

a wall with a lot of circles on it — Photo by Maria Teneva on Unsplash

If you examine the fundamental difference between AHV and ESX, once you set aside the features, APIs, and opinions, the most basic question is: “When is the host down?”

ESX asserts that as long as the kernel is running, the host remains up because a workload may be running or about to start. Even if the kernel is unreachable from the outside, ESX continues to run. The only person who can decide the host is down, therefore, is a human.

AHV, on the other hand, believes that once it is no longer part of the quorum, the host is down.

Both approaches have value, but they yield different outcomes.

With ESX, the human has complete control over deciding when to restart a host. Because only the human knows whether the host is running, each additional component must continue functioning until instructed otherwise and must keep operating even if other system parts are down.

It’s why, for example, with vSphere HA, even if the network is partitioned, all hosts will run workloads.

Until the human indicates that ESX is down, all system components should assume a workload is either running on the ESX host or may start running, so they must try to keep running as well.

The difficulty is that a malfunctioning piece of software can appear just like one that is very slow.

Therefore, each layer advances without knowing if another will do the same later, which can result in incorrect decisions.

A trivial way to prove this is with backup and restore. When you restore a system from a backup, to the outside observer, that’s indistinguishable from a very slow system. The ancient system must now catch up with the current state of the world. To do that, it has to be able to read the current state, but there is no precise current state. So at some point, a human must be involved to resolve inconsistencies. It’s why restoring VCF is so painful.

The benefit of this approach is that it allows surprisingly fast delivery of components, as long as integration and consistency are less critical than the speed of feature delivery for each one.

The downside is that when two systems need to agree on the system’s status, they cannot. Because only a human knows if the system is up, down, or slow, any software trying to coordinate between two components can only make an educated guess about what’s happening.

To handle this, you need to invest in more tools, monitoring, and observability. But it’s always a guess.

The alternative approach of AHV has the property that the software systems are aware when the host is down, since the computer makes the decision independently of whether workloads are running on the AHV host.

More importantly, any workload on that AHV host will not run until the cluster control plane reinstalls it on the host.

As a result, any layered system knows exactly when to stop.

Consequently, all layered systems are aware of each other’s state.

And all parts of the system agree on the state of the workloads.

The upside of this approach is that the system is correct, scales better, and is simpler to operate and develop against. The downside is that until the quorum system is more reliable than a single kernel, your system is less reliable.

What AHV has done is make its clustered system as reliable as a single kernel. And that is an astonishing achievement.

Once that is achieved, and if overall system behavior is more important than any one single system, then the simplicity of the AHV approach allows for faster feature delivery because the complexity of integration is simple.

the nutaxinist 13: x86 virtualization may not be what you think it is, bare metal is roaring back and why you need a different platform like AHV

July 24, 2025 by kostadis roussos Leave a Comment

I wrote this a while ago, and since then, I have learned a great deal more about what makes AHV special. And although I talk about the database here, it’s not just about the database; it’s also about the kind of OS and the availability models of that system.

Photo by Jonny Gios on Unsplash

One of the more enduring mysteries about x86 virtualization is how profoundly misunderstood it is by those who are distant from it, considering its widespread use.

For folks with a passing understanding, they assume that something intercepts every instruction between the actual processor and the workload and translates it on the fly.

Except that’s not been the case from more or less the beginning.

What VMware and most other vendors did was virtualize a processor’s control instructions, not the workload instructions.

A processor has a set of instructions and capabilities for running workloads and a set of capabilities and instructions for managing the hardware.

The OS-processor interface is peculiar and continuously evolving. By its very nature, it was initially engineered to assume that only one OS ever interacted with it.

VMware virtualized that OS-Processor interface, enabling multiple different OSs to run on the same x86 hardware.

Once the processor’s control plane was virtualized, it became possible to build an OS (ESXi) that treated VMs as first-class abstractions.

ESXi enabled far more sophisticated control and sharing of the physical resources. It could do that because the control plane was virtualized, and when it needed to interfere with the running guest, it was able to.

Nowadays, every OS does the same thing—it virtualizes the OS-processor interface and, using that abstraction, can run multiple VMs on a single processor.

Unfortunately, we take this for granted because it is an astonishing technical result, and we are too impressed with it.

And given that every OS on the planet, including many free ones, and that customers wanted to run mixed workloads and they chose to use an inferior form of virtualization, the cognitive dissonance between I like bare metal and virtualization isn’t good enough, hurt my head.

So I dug into it.

They are saying that a modern application’s control plane is a distributed system. They want a distributed infrastructure control plane on which multiple applications can rely. Virtualization does provide a mechanism for sharing a server, but that’s not useful without a distributed infrastructure control plane that applications can share.

The industry-leading virtualization does not have a clustered control plane. So, customers naturally look towards bare metal Kubernetes (K8s) because it has a distributed database.

And then the same customers use kube-virt to create VMs. The shift to bare metal is thus not about virtualization, but about what control plane virtualization you need. Today’s applications require the infrastructure control plane to be virtualized.

The next generation of infrastructure management will depend on vendors who figure out how to virtualize the interface from the K8s API server to the underlying infrastructure itself.

To achieve this, you need a distributed database that is more reliable than etcd. Why? Because etcd is what you don’t have to pay for.

Fortunately, Nutanix has one of those.

the nutanixist 03: AHV is more reliable for a broad range of workloads than alternative systems.

July 12, 2025 by kostadis roussos 1 Comment

Let me clarify upfront: ESXi is a fantastic operating system. Its availability is exceptional. The team that maintains it is outstanding. The management leadership that ensures quality is superb.

However, the argument that it is more available is very narrow and insists that one definition of availability is the only one that matters.

Consider a single VM. If the VM relies solely on external storage, then ESXi’s local and stateful control plane guarantees the VM will keep running as long as the server has power and the storage functions properly.

What AHV offers, compared to systems like OpenShift, vSphere, and others, is the guarantee that the clustered control plane remains available to all hosts within a quorum.

This is a fundamentally different and compelling guarantee, and it is critical for the correct operation of a workload.

Consider any modern workload that depends on the infrastructure control plane, such as any Kubernetes workload. If the Kubernetes system cannot allocate a persistent volume because the infrastructure control plane is down, then the workload is impacted.

Or consider a scenario where a set of hosts gets partitioned. If the workload must run within the same partition, systems lacking a clustered control plane cannot ensure they stay in the same partition. Thus, the VMs might be running, but the workload itself isn’t.

A clustered system guarantees that the VMs and the workload run within a single partition.

Similarly, any workload requiring a clustered service as part of the infrastructure, like HCI or SDN, depends on a control plane external to the local OS. If that control plane becomes unavailable, the workload is also unavailable.

Additionally, running a workload effectively involves maintaining system balance. If the OS on the server is running but the load balancer is not, then the system will run in an unbalanced state until the load balancer goes online. During that time, performance will be impacted. And if performance is affected, then the workload’s availability will be impacted.

What is clear is that a workload depends on the local host and the cluster control plane being operational.

In all these cases, the AHV guarantee of the availability of the clustered control plane offers clear advantages. Its control plane for Kubernetes can tolerate a single host failure and continue running. Its control plane supports non-disruptive upgrades. AHV will only start VMs within an active partition. The storage cluster’s availability is measured in 99.999% uptime and is fully autonomous. vSAN, on the other hand, requires vCenter for critical functions like upgrades. AHV’s load balancer remains operational as long as a quorum of hosts exists.

While at VMware, I tried hard to fix this in vSphere. I initiated a series of projects that were consistently deprioritized to address this critical functional gap.

A funny story. I had a 1:1 with Hock Tan. It was a fun meeting. He asked me what I was working on, and I replied, “Well, I am working on making ESXi clustered, but it got canceled.” And he was about to explain to me that the reason it got canceled was because of VMware’s inability to set priorities.

And of course, I couldn’t resist and said – “Well, actually no. What happened was that you bought the company, and so we decided to use that time to pivot to subscription revenue.”

Hock looked at me in a way that I interpreted as “Well, I was about to give you this lecture, and you deprived me of it.”

He recovered, however. There’s a reason why he’s who he is, and he said, “We’ll get to it after the acquisition closes.”

I was hopeful. Unfortunately, things didn’t quite work out for me and the project.

My critique of single-node OS’s for clustered systems is a long-standing one.

the nutanixist 02: the problems with stateful hypervisors

July 6, 2025 by kostadis roussos Leave a Comment

I recently wrote about how AHV is deeply misunderstood. And what struck me is how deeply misunderstood AHV, ESXi, and Openshift are.

ESXi is amazing software with a dedicated team and satisfied customers. Its architecture enables continuous operation of VMs even if hosts become disconnected, thanks to a local control plane on each host that manages registered VMs independently, relying solely on storage. However, this control plane maintains state, so if it fails or becomes unavailable, ESXi becomes inaccessible even though VMs continue running. The control plane relies on user-space services, such as hostd and vpxa, which can fail for various reasons. This architectural design, while effective, has inherent limitations. Notably, this architecture serves as the foundation for every other commercial Hypervisor, except AHV. Openshift faces similar issues, due to kubevirt and kubetcl.

I will now focus on the architecture rather than the products.

When you build a clustered system, you are creating a clustered control plane. And you have two design choices. One is to build on top of a local cluster control, and the other is to build directly on the data plane.

If you build on the local cluster control plane, you have two challenges. The first step is to detect any actions taken by the local cluster control plane and reconcile those actions. The second is that there are plenty of scenarios where the VM is running but the local cluster control plane is down, and can’t be recovered. And so, it becomes very tricky to determine whether a host is up or down. Because, as the cluster control plane, you don’t know if the VM is running or not.

A clustered control plane is almost always running in a split-brain mode, where it hopes that it knows just enough of the local state to make the right decisions, and it expects the local control plane won’t make decisions that break it. Not being able to determine the state of a host deterministically, and whether it is up or down, makes the system fragile. Why? Because the host, while “seemingly” being down, can be up. And while it’s in this disconnected mode, the host can be modified. At the same time, the cluster state can also change. When the host rejoins the cluster, a human must reconcile a state that cannot be reconciled. Although this happens infrequently, it can’t be guaranteed, so all this complexity exists solely to handle issues that arise from the basic guarantee of stateful hypervisors.

So, are clustered control planes good, and are stateful ones bad? No. That’s simplistic. However, if your system relies on a clustered service, such as HCI, or if the applications are clustered, or if you have modern workloads that require interaction with the underlying control plane to operate, a clustered control plane is necessary. And if it is essential, then the choices for building it and the implications of those choices matter.

the nutanixist 01: the deeply misunderstood AHV

July 3, 2025 by kostadis roussos 3 Comments

Eleven years ago, when Nutanix announced AHV, my initial reaction was: This will fail.

The idea that someone could successfully introduce a new commercial hypervisor into the market seemed ridiculous.

But 11 years later, it has proven me wrong.

Despite this, AHV remains deeply misunderstood because of its uniqueness.

Most operating systems are stateful, meaning that the system’s state is stored within the OS, and when the server restarts, the OS retrieves its state from disk.

AHV, however, is stateless, and so?

Consider a VM. For example, with ESX, you can create a VM through the ESX console, and if ESX crashes, the VM’s state is saved on local disks that ESX reads to restart it.

Much of vCenter’s job is to monitor what’s happening in ESXi and respond to an environment that doesn’t match its expectations of what ESXi was doing.

With AHV, creating a VM requires accessing the cluster control plane, which runs on the CVM—a special VM that manages the cluster’s state. For more details, see The Nutanix Bible.

Thus, when the OS boots, the cluster control plane determines its state.

Why is this so powerful?

During a reboot, the control plane doesn’t need to determine what AHV considers to be running, nor does it need to stop or start processes.

More importantly, the AHV state can’t be out of sync with what the control plane believes.

This setup also significantly simplifies everything. Systems like ESXi or Linux require building a layered control plane on top of the OS’s control system. This layered system must interpret and respond to the actions of the underlying control plane. If it needs to stop or change something, it is at the mercy of the underlying controls.

Most of the time, this isn’t an issue because complex software, such as kubevirt or hostd, tries to reconcile conflicting control goals.

With AHV, that intermediary layer doesn’t exist.

This leads to the misunderstanding: AHV is stateless and can’t be directly compared to ESXi or Linux/KVM. Instead, you should compare it to ESXi + vCenter or Linux with something like OpenShift.

When you make this comparison, you realize that AHV provides a level of availability and control that’s unmatched.

For example, vCenter manages multiple clusters, and any single cluster can impact all workloads. If vCenter fails, it affects all workloads, necessitating the use of numerous vCenters, which in turn leads to vCenter sprawl and increased operational overhead.

With AHV, the boundary is a cluster. Each cluster is isolated from the others.

Thanks to Prism Central, it’s more feasible to run multiple workloads on separate, isolated clusters.

Because the AHV control plane is clustered, its availability is tied to the availability of Nutanix clusters, unlike vCenter, which runs as a single VM.

All of this is possible because AHV’s stateless design allows the creation of a cluster control plane with a small team, achieving something once thought impossible.

nutanixist 22: the impact of system consistency in ahv vs esx and other systems

Like this:

the nutaxinist 13: x86 virtualization may not be what you think it is, bare metal is roaring back and why you need a different platform like AHV

Like this:

the nutanixist 03: AHV is more reliable for a broad range of workloads than alternative systems.

Like this:

the nutanixist 02: the problems with stateful hypervisors

Like this:

the nutanixist 01: the deeply misunderstood AHV

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: