the nutanixist 02: the problems with stateful hypervisors

I recently wrote about how AHV is deeply misunderstood. And what struck me is how deeply misunderstood AHV, ESXi, and Openshift are.

ESXi is amazing software with a dedicated team and satisfied customers. Its architecture enables continuous operation of VMs even if hosts become disconnected, thanks to a local control plane on each host that manages registered VMs independently, relying solely on storage. However, this control plane maintains state, so if it fails or becomes unavailable, ESXi becomes inaccessible even though VMs continue running. The control plane relies on user-space services, such as hostd and vpxa, which can fail for various reasons. This architectural design, while effective, has inherent limitations. Notably, this architecture serves as the foundation for every other commercial Hypervisor, except AHV. Openshift faces similar issues, due to kubevirt and kubetcl.

I will now focus on the architecture rather than the products.

When you build a clustered system, you are creating a clustered control plane. And you have two design choices. One is to build on top of a local cluster control, and the other is to build directly on the data plane.

If you build on the local cluster control plane, you have two challenges. The first step is to detect any actions taken by the local cluster control plane and reconcile those actions. The second is that there are plenty of scenarios where the VM is running but the local cluster control plane is down, and can’t be recovered. And so, it becomes very tricky to determine whether a host is up or down. Because, as the cluster control plane, you don’t know if the VM is running or not.

A clustered control plane is almost always running in a split-brain mode, where it hopes that it knows just enough of the local state to make the right decisions, and it expects the local control plane won’t make decisions that break it. Not being able to determine the state of a host deterministically, and whether it is up or down, makes the system fragile. Why? Because the host, while “seemingly” being down, can be up. And while it’s in this disconnected mode, the host can be modified. At the same time, the cluster state can also change. When the host rejoins the cluster, a human must reconcile a state that cannot be reconciled. Although this happens infrequently, it can’t be guaranteed, so all this complexity exists solely to handle issues that arise from the basic guarantee of stateful hypervisors.

So, are clustered control planes good, and are stateful ones bad? No. That’s simplistic. However, if your system relies on a clustered service, such as HCI, or if the applications are clustered, or if you have modern workloads that require interaction with the underlying control plane to operate, a clustered control plane is necessary. And if it is essential, then the choices for building it and the implications of those choices matter.

Share this:

Like this:

Leave a ReplyCancel reply