wrong tool

nutanixist 24: how nutanix made k8s persistant volume provisioning more reliable and available

October 29, 2025 by kostadis roussos Leave a Comment

One of the core impedance mismatches between k8s control planes and compute control planes is how disks are attached and the constraints thereon.

Why does it matter?

assorted electric cables — Photo by John Barkiple on Unsplash

Whereas with VMs, adding a disk is a relatively rare day 2 operation, in a k8s environment, attaching a disk is part of restarting a pod that failed.

In a previous post, I wrote about how the hypervisor’s host control plane prevents adding a disk to a VM while the VM is being moved.

And how that fundamentally affects the availability of the application that runs in kubernetes.

Now I want to talk about another challenge.

To create a virtual disk via the CSI, you must interact with the infrastructure control plane.

Now the performance, availability, and location of the infrastructure control plane matter.

With Nutanix, you can configure the CSI system to communicate directly with the PE. When you do that, our CSI provider provisions a virtual disk, and the CSI interacts with the underlying PE control plane running on the kubelet. What’s important is that if the VM is running, then the PE control plane is accessible because an endpoint exists on the same physical host.

If you do not use the Nutanix CSI in PE mode, the CSI provider must communicate with the PC. This can lead to issues where the kubelet is unable to provision a disk because it depends on an external system.

The VCF 9.0 product documentation includes an excellent illustration of this architecture.

This leads to an availability mismatch, which adds complexity. The external control plane must be more available than any host that creates a pod. The network must be designed to support that level of availability. While this is achievable, it introduces additional tradeoffs.

What I like about the Nutanix platform is the choice it offers. And depending on the tradeoffs that matter for you, you can make different choices.

nutanixist 23: overcoming kubernetes and vm storage limitations

October 29, 2025 by kostadis roussos 1 Comment

Virtualization offers workload isolation and separates infrastructure and application teams. This separation allows operations such as vMotion to proceed without coordination, enabling host maintenance and infrastructure rebalancing to proceed seamlessly.

One notable limitation of Kubernetes (k8s) and virtual machines (VMs) is the interplay between pod deployment and persistent volumes. Platform engineers want the ability to deploy pods and create storage on demand quickly. However, the virtual machine abstraction complicates this, making pod deployment more challenging and negatively affecting application availability.

For instance, when a virtual machine has a single virtual disk and needs to attach another, this operation is blocked during mobility tasks with hypervisor-attached storage.

Now, in a traditional VM environment, adding a virtual disk isn’t too big a deal. Adding another virtual disk is not a typical day 2 operation.

But in k8s, whenever I deploy a new pod to a VM and want persistent disks, I have to add another virtual disk.

So now, whenever the infrastructure admin wants to perform a rebalancing or maintenance, they must coordinate with the platform engineering team or the application team.

The whole point of virtualization is to provide isolation, yet because of this behavior, you lose it.

So co-ordinate!

Except for one critical use case called “High-Availability,” where the VM is rebalanced both before and after a server failure. So when a server fails, if a pod fails, and your VMs are being rebalanced, your pod restart can hang for an indeterminate amount of time. And if it hangs, then your application either runs in a degraded mode or doesn’t run at all.

This limitation exists for all KVM-based hypervisors, to the extent I am aware, and for VMware hypervisor-attached storage.

Nutanix, however, offers another class of storage, called a “volume group,” that has been available for 5+ years and allows a guest to attach to a virtual disk via iSCSI.

Nutanix calls that a “guest attached” volume group.

There is a trade-off, of course, in using this iSCSI layer. The Nutanix CSI driver handles the details.

In a vSphere world, you could use iSCSI to an external storage array from the guest, which introduces another set of trade-offs. It also complicates the environment’s operations. vVols tried to make that better.

With Nutanix, the nice property of the volume group is that I can attach multiple virtual disks and apply data management policies to the volume group, such as snapshots and DR, so that as new disks are created, they inherit those policies.

And so I get the simplicity and flexibility of virtual disks, without any of the day-2 headaches of hypervisor-attached storage.

nutanixist 22: the impact of system consistency in ahv vs esx and other systems

September 27, 2025 by kostadis roussos Leave a Comment

AHV prioritizes system consistency over workloads, whereas ESX and every other OS prioritizes workloads over system consistency.

a wall with a lot of circles on it — Photo by Maria Teneva on Unsplash

If you examine the fundamental difference between AHV and ESX, once you set aside the features, APIs, and opinions, the most basic question is: “When is the host down?”

ESX asserts that as long as the kernel is running, the host remains up because a workload may be running or about to start. Even if the kernel is unreachable from the outside, ESX continues to run. The only person who can decide the host is down, therefore, is a human.

AHV, on the other hand, believes that once it is no longer part of the quorum, the host is down.

Both approaches have value, but they yield different outcomes.

With ESX, the human has complete control over deciding when to restart a host. Because only the human knows whether the host is running, each additional component must continue functioning until instructed otherwise and must keep operating even if other system parts are down.

It’s why, for example, with vSphere HA, even if the network is partitioned, all hosts will run workloads.

Until the human indicates that ESX is down, all system components should assume a workload is either running on the ESX host or may start running, so they must try to keep running as well.

The difficulty is that a malfunctioning piece of software can appear just like one that is very slow.

Therefore, each layer advances without knowing if another will do the same later, which can result in incorrect decisions.

A trivial way to prove this is with backup and restore. When you restore a system from a backup, to the outside observer, that’s indistinguishable from a very slow system. The ancient system must now catch up with the current state of the world. To do that, it has to be able to read the current state, but there is no precise current state. So at some point, a human must be involved to resolve inconsistencies. It’s why restoring VCF is so painful.

The benefit of this approach is that it allows surprisingly fast delivery of components, as long as integration and consistency are less critical than the speed of feature delivery for each one.

The downside is that when two systems need to agree on the system’s status, they cannot. Because only a human knows if the system is up, down, or slow, any software trying to coordinate between two components can only make an educated guess about what’s happening.

To handle this, you need to invest in more tools, monitoring, and observability. But it’s always a guess.

The alternative approach of AHV has the property that the software systems are aware when the host is down, since the computer makes the decision independently of whether workloads are running on the AHV host.

More importantly, any workload on that AHV host will not run until the cluster control plane reinstalls it on the host.

As a result, any layered system knows exactly when to stop.

Consequently, all layered systems are aware of each other’s state.

And all parts of the system agree on the state of the workloads.

The upside of this approach is that the system is correct, scales better, and is simpler to operate and develop against. The downside is that until the quorum system is more reliable than a single kernel, your system is less reliable.

What AHV has done is make its clustered system as reliable as a single kernel. And that is an astonishing achievement.

Once that is achieved, and if overall system behavior is more important than any one single system, then the simplicity of the AHV approach allows for faster feature delivery because the complexity of integration is simple.

the nutanixist 20: how to build an AZ using soft transactions, a clustered IO path, and a stateless hypervisor without a hyperscalar cloud network

September 21, 2025 by kostadis roussos Leave a Comment

I’ve been pondering the problem of making infrastructure transactional for 20 years.

The one paper I wrote – https://www.usenix.org/legacy/event/lisa07/tech/full_papers/holl/holl.pdf is an early attempt at trying to get the desired state systems to work.

You can read the paper, but the critical idea (and it’s an ancient one) was that you take all of the control plane code and put it in the central system.

The problem with that approach (and why the product failed) is due to availability.

The thing we built had the nice property of simplicity of management. It had the unfortunate property of being less available than what it tried to replace. What do I mean? Our solution required a single centralized control plane. If that control plane failed, then snapshots, mirrors, and backups failed. Without our control plane, each NetApp Filer managed its own schedule and failed independently.

Storage administrators barfed all over it. They rejected the product and the architecture.

Then I went to Zynga. And there I took another stab at the problem of managing systems at scale. And there we built some pretty slick management software that allowed Zynga to scale to 100 million MAU for Cityville, on what was basically the flakiest infrastructure I have ever used. The critical insight I had at Zynga was that since transactional systems at scale didn’t work with a centralized database, you needed to build something that relied on eventual consistency.

Then I came to VMware and decided to tackle the problem of deterministic infrastructure at scale again. That’s when I realized there wasn’t really a solution to my problem.

Abstract spiral pattern with warm light and shadow. — Photo by Milad Fakurian on Unsplash

What was my problem:

I had several hundred distributed databases (one per cluster), and I wanted to manage particular semantics that didn’t quite fit into a cluster’s semantics. For example, networking spans clusters.

And I failed to come up with an answer.

What do I mean? The current system requires manual intervention to keep running. The new eventually consistent system also required manual intervention to keep running because it wasn’t deterministic.

So what was the win? Unclear. But there was a win around per-cluster state, and so we decided to solve that. Working with Brian Oki, who did most of the heavy lifting, we devised a plan to make forward progress. We decided to push the cluster state into the cluster.

We began working on an internal project called Bauhaus, despite not having a definitive answer on how to approach networking. Bauhaus was about moving some of the cluster state into the cluster using a distributed KV store to simplify recovery and improve resiliency.

The critical insight I didn’t have was “AZ”

An AZ is one of those concepts that practitioners of distributed systems have spectacularly failed to define, and it is the most fluid of all.

Ask 50 practitioners and you get 50 answers.

And because of that, it’s too amorphous to build systems with.

But there is a crucial insight about an AZ:

An AZ is a control plane that, when it fails, the hardware it manages becomes unusable, even if the hardware is powered on.

An AZ from the outside observer’s perspective is one thing.

But the critical activity in cloud engineering is “how do I build an AZ so it appears to be one thing, but is actually built from many things.”

The thing that’s not obvious to folks who don’t spend too much time puzzling this problem is how the network is built in the cloud.

If you examine the cloud, the critical aspect of their systems is a highly redundant and substantial bandwidth inter- and intra-data center network.

Every cloud has its own proprietary networking stack, which, when you interact with it (from the underlying, not the overlay), requires a significant amount of bridging magic. Those underlay networks do not have all of the semantics or properties of traditional IP networks.

It’s the existence of those networks that allows for the cloud to provide a transactional system behavior.

So let me be precise:

In the cloud, I can assert that if I can’t reach a node, the node is down.

If I can’t reach the AZ, it’s down.

And if a VM was created in AZ 1, it’s either running in AZ 1 or not running in AZ 1. It can not exist outside of AZ 1.

Without the cloud networks and the fact that every part of the system was engineered around this principle, building an AZ-like construct on premises was very difficult without extensive investment in network and hardware design.

What these Nutanix guys did is figure out how to work around this using a custom data path and soft transactions.

Rather than relying on the network connectivity to determine if a VM is running or not, they used the IO data path and a stateless OS.

The IO data path guarantees that any hypervisor that boots cannot access any state that the clustered control plane doesn’t want it to access.

The stateless OS allows the cluster control plane to program the OS to its new state trivially.

The existence of a clustered IO path and a stateless hypervisor allows the cluster to control what state is being modified and which workloads are running. In effect, the clustered I/O path and stateless hypervisor enable the cluster as a whole to operate as a single entity.

As I mentioned earlier, soft transactions and a distributed database are what enable this scalability.

In this incredibly long and complex journey, I was fortunate to work with some brilliant people, but a critical person was Dahlia Malkhi, who, when I hit a brick wall, made it possible for me to see the path around it. I call her out because she was a researcher, and we may have interacted on a technical topic 2 or 3 times, and each time was seminal.

the nutanixist 21: architecture is why Pure and nutanix could deliver a great solution in record time

September 15, 2025 by kostadis roussos 2 Comments

One of Pure Storage’s more remarkable achievements was its integration with vVOLs. They spent years integrating and making vVOLs work. And, without a doubt, a significant part of the reason this was easy was due to the work the Pure team did.

people hands reaching out — Photo by Priscilla Du Preez 🇨🇦 on Unsplash

But this is why it was easy for Nutanix and Pure.

vVOLS was the answer VMware had to how to enable storage vendors to provide VM data management that was integrated into VMware’s policy-based management framework.

That’s a mouthful.

In 2008, NetApp introduced a product called SnapManager for Virtual Infrastructure, which revolutionized how people discussed storage integration with VMware. Instead of seeing storage as independent of compute, it was presented as an integrated operational workflow. The VI admin using SMVI could directly integrate with NetApp storage to take snapshots and provision storage.

In 2011, Nutanix introduced HCI, which provided VM-level data management that bypassed the operational concerns of storage administrators by removing them from the equation entirely.

In 2012, VMware introduced policy-based storage management, along with the first incarnation of vVOLs in 2015, to enable policy-based management of storage.

What VMware aimed to do was enable the entire storage ecosystem to integrate with the vSphere control plane, providing the operational value of VM data management in a consistent, vendor-agnostic manner.

Effectively, the goal was to move around competitors like Nutanix and NetApp by commoditizing the VM data management. And make vSphere the way you manage data, with storage vendors acting like providers.

It was a good idea, and it was on the cusp of greatness, but for what I can only imagine were misguided, petty reasons, VMware canceled it.

Many of the challenges of vVOL were inherent to vSphere, making integration very difficult.

vSphere doesn’t have a cluster control plane, and VMFS does not have a single control point for I/O; the VMFS IO path is in the kernel.

So, what were vVOLs? Without getting too deep into the weeds, what VMware did for vSAN was add a new path in the core storage stack of vSphere. That layer was then integrated into the vSAN cluster control plane. That same interface was then externalized to the partners.

And that was the problem.

The storage partner was tasked with the complicated problem of building a clustered storage control plane. Why? Because vSphere, as I have explained elsewhere, doesn’t have a clustered control plane and allows independent hosts to make independent decisions that the control plane must react to.

When vMotion occurred, the VASA provider was involved in the operation as it had to unmount a LUN from an ESXi host and then remount it on another host.

But it was messier. Because vSphere cannot guarantee the number of hosts that will connect to a storage array or the number of LUNs that will be mapped, the VASA controller must manage any limits.

And then, finally, due to VMFS limits, the number of vVols that could be connected to any host was limited.

For Nutanix, these problems didn’t exist. Due to the clustered control plane, we could ensure that the number of LUNs connected to a storage controller remained within the limits agreed upon by Pure and Nutanix. Because our IO path was in user space, we could mount and map every virtual disk on every host. And because of our clustered control plane, during a Live Migration (Nutanix’s vMotion), we could handle the re-routing of the IOs without requiring the external storage provider to do the fencing for us.

Unlike vVOL, which requires the storage vendor to build a clustered control plane from the basic primitives of the per-node file system, the storage vendor integrates with a cluster control plane and operates on cluster-level semantics.

That is why our integration was fast.

And more importantly, why our integration delivers more value and better availability than the dearly departed vVOLs.

the architecturalist 63: nutanix was the correct answer

September 13, 2025 by kostadis roussos 1 Comment

In 2012, while at Zynga, I had a moment of clarity that the way we had thought about infrastructure up to that point was wrong. That our focus on making a single node more and more available was a dead end.

I wrote about this on Quora, and it was picked up by Forbes, which gave me 1 minute of fame.

And I wrote this:

person using magnifying glass enlarging the appearance of his nose and sunglasses — Photo by Marten Newhall on Unsplash

NetApp’s engineering spent a lot of time worrying about hardware availability and making hardware appear to be much more resilient than it actually was.

And yet, these guys like Facebook, Twitter, and Google didn’t think that was important.

Which was mind-boggling. How else can you write software if the infrastructure isn’t perfect? “What were you people doing?” I thought.

So what drove me to find another job was that somehow, people were building meaningful applications that didn’t need component level availability. Something was changing…

Which brings me to what was changing.

What was changing, and this only became obvious after I joined Zynga, was that the old model was dead.

In a world where you have thousands of servers, depend on services that change all of the time, the notion that the application can be provided the illusion of perfect availability is, well, foolish.

In fact, applications have to be architected to understand failures. Failures are now as important to software as thinking about CPU and Memory and Storage. Your application has to be aware of how things fail and respond to those failures intelligently.

I believe that the next generation of software systems will be built around how do you reason about failure, just like the last was about how do you reason about CPU and Memory and Storage.

For the last 13 years, I have been wondering what the correct answer is. One school of thought believed that the correct answer was to treat everything as a database transaction. What if we made infrastructure transactional?

As a result, numerous attempts were made to develop management applications that updated the model of the world in a database and attempted to force the real world to conform to that model. I even invented one and published a paper that described such a system.

And they kind of worked.

The general idea was that you had an API that updated a database, and then a set of controllers that would go and modify the world to conform to the database. And if they ever detected an inconsistency between the world and the database, they would go and correct the system to conform to the database.

And those systems failed to deliver on transactional infrastructure.

When you invoke an API, the database gets updated, and the world converges, but here’s the rub: the world can diverge. And you wouldn’t know.

Let me provide an example from vCenter, a product with which I am very familiar.

Let me be specific – you tell vCenter to power on a VM. vCenter updates its database, then communicates with ESXi, and the VM is powered on.

But is the VM powered on?

You don’t know, because a user can log into ESXi and power off the VM.

In effect, ESXi has its own database and API. And that API and database can be used to change the state of the system.

To make matters worse, if a network partition occurs, the VM will be powered on, and vCenter cannot determine if the VM is powered on or not.

Therefore, any piece of code written must account for three states: “Yes, No, and I don’t know.”

Now, if it’s only one client calling vCenter and doing one thing at a time, that’s manageable. However, if you are working with workflows that depend on the VM being powered on, for example, powering on the VM, moving it, and so on, then for every step, you must account for the possibilities of ‘yes’, ‘no’, and ‘maybe’. And that handling all the different kinds of ‘maybe’ makes writing the control plane tricky.

And when I was at Zynga, I would like to believe I had identified this problem, but I had no idea how to solve it.

For years, I thought the only path forward was the desired state. In short, you express an intent, and the system converges to that intent. But the problem with that model is that expressing things as a sequence of operations is more convenient than simply describing intent. The problem with intent is that if you need to express two different contingent intents, how do you do that? And yes, you could, but pretty soon, you have one massive intent that describes the entire universe.

And so the approach, although promising, never materialized.

And then I ended up at Nutanix. I have also noted that Nutanix has a distributed database at its core, which is part of the puzzle. However, as I mentioned earlier, it’s only a part.

There were three more.

The second was the ability to have a parent database with multiple child databases, and that the parent database would always receive updates in the correct write order.

The third was soft transactions. This is critical because the system must perform reliably and be able to tolerate failures.

But the piece of the puzzle that eluded me was the need for two magical pieces of technology: the first was AHV, a stateless operating system, and the other was Stargate, a clustered IO path.

What Stargate guarantees is that the cluster knows which disk is being connected to, and it provides a point of control for the disk. It is not possible to change the state without Stargate knowing. And so, for a cluster, Stargate can prevent anyone from accessing disks and assert who is accessing them.

The second is AHV, which, when it reboots, doesn’t remember what it was doing before it rebooted. Therefore, AHV cannot run any workload without the cluster knowing what the workload is.

When you combine all five pieces of technology, you have the answer to the question I posed.

The infrastructure, by design of the datapath and system components, only has two answers to any operation: “Yes, I completed, and No, I didn’t.” And either is definitive. There exists no other possible answer to the question.

Once you have such a system, it becomes possible to implement two services that control the OS and the datapath that can assume the behavior of the infrastructure is binary.

And once you do that, you can build a system of APIs that always return yes or no to any question.

This then allows you to combine APIs into workflows that can be trivially designed. What do I mean?

Suppose I have a workflow that must call 5 APIs. We model this as a single workflow comprising five tasks.

In transactional infrastructure, after each API returns a response, I know what the environment must be. And therefore, if it says “Yes”, I can advance to the next step knowing that it is “Yes.” In other words, if Task 1 is completed successfully, I can easily advance to Task 2.

So let’s consider the alternative. Task 1 is to power on a VM. Task 2 is to attach a network to the VM. If Task 1 declares success, Task 2 might fail because someone behind the scenes shut down the VM. Now, Task 2 must handle an error. But what does this mean for the workflow? Did the workflow fail? Well, it didn’t. What happened was that the environment changed in a way that the workflow was unaware of.

So let’s look at the workflow state –
Task 1 – power on VM – success
Task 2 – Attach Network – Failure because the VM is not powered on.

This is a contradiction. How could Task 1 succeed and Task 2 fail? This is a contradiction because the workflow didn’t account for another system changing the state of the VM behind the scenes. And because the change occurred outside of the system, the program interacting with the APIs cannot determine why it has a contradiction.

To understand what happened, you need to build yet another system that monitors both the workflow and the system that can be changed outside the workflow’s control.

Intent-based systems attempted to work around this by retrying, but, as I mentioned, they had their own issues, the most significant being an infinite retry loop.

Ultimately, the only solution was to make it impossible for the system to be changed that was not under the control of the control plane.

And that’s what the folks at Nutanix did.

the nutanixist 19: the arrogance of the broadcom shift in cloud credits

September 8, 2025 by kostadis roussos Leave a Comment

I usually don’t discuss business models, but what Broadcom did is a good example of how thinking you have an irreplaceable product and not understanding your customers can cause problems.

One of the main challenges with VMware was the variation in business models, which made license portability difficult.

That variation also led engineering teams to go to great lengths to avoid collaborating.

In many ways, VMware was like several companies in one, each with its own distinct business model, selling layered products on top of vSphere.

Changing this corporate structure has been the main driver behind Broadcom’s changes to VMware’s product setup.

At the same time, we live in a very complex world. Corporations have very complicated budgets.

The core of selling goods is to meet your customers where they are.

Broadcom’s goal is to remove customer choice. The idea is that by forcing customers to work with the team that owns VCF credits, workloads will be pushed back into on-premises environments or not moved to the cloud.

Here’s how it works:

Think about “Big Corp Co.” with two teams. Team A wants to run some workloads in the cloud, while Team B is responsible for on-premises virtualization. Team A has workloads on vSphere that need to be migrated to the cloud, and the most cost-effective way to do this is to utilize some of their corporate credits.

But now, Team A can’t use those credits.

Since running the stack on VCF requires VCF credits, they will be directed to the internal VCF team, Team B, which will tell them that instead of running in the cloud, they should run on-premises.

Team A might protest, but Team B, which controls the budget, will explain that there isn’t enough budget for the cloud deployment and offer an on-premises alternative.

Therefore, Team A, because they are forced to use VCF for certain reasons, will theoretically have to move any workload that could have run in the public cloud back on-premises.

This approach assumes Team A has no options.

And that’s where this model of limiting choices fails.

Customers always have options.

And when you force someone to do something, they will quickly find ways to choose differently.

It’s why Nutanix has been adding many customers lately.

the nutanixist 18: relying on infrastructure instead of application specific availability

September 7, 2025 by kostadis roussos Leave a Comment

From 1996-2009, it was believed that application availability was an infrastructure issue, so improving reliability meant making infrastructure more resilient. Tandem NeverFail symbolized the ideal of reliable infrastructure, with systems like the Origin 2000, featuring a single system image and NUMA, representing the peak of scalable computing. However, in the mid-1990s, Gregory Pfister’s book “In Search of Clusters” argued that building such infrastructure was too complex, advocating instead for clustering.

At the time, this idea seemed absurd. When distributed systems were first being deployed in the mid-1990s, it seemed truly crazy.

As a result, infrastructure vendors continued to focus on making single systems more resilient.

When the cloud emerged, infrastructure architects like myself viewed it skeptically because of its lack of guaranteed availability. “How would applications run on it?” we wondered.

What we didn’t realize was that software naturally seeks to operate on cheaper hardware, and because of this, new technologies have arisen to make that easier.

For me, the pivotal moment came in 2009, when Cafeville was running on an effectively 1000-node cluster. The team combined various components with some critical innovations.

This marked the beginning of an era where availability shifted from being purely an infrastructure concern to an application problem because infrastructure became less reliable.

My critique of vSphere and similar systems that aren’t natively clustered is that they are inherently less reliable than what applications require. Consequently, application teams must write code assuming infrastructure instability rather than depending on the system’s reliability.

What do I mean by infrastructure instability?

In the pre-Cloud era, infrastructure was assumed to either work or fail. In the cloud era, uncertainty in infrastructure was acceptable if the application, its developers, and the operations team could identify what went wrong—uncertainty was tolerated.

The problem was that this increased the cost of maintaining and supporting applications and slowed down development, as teams spent more time on infrastructure issues than on the applications themselves.

At Zynga, when my team provided a reliable infrastructure, team sizes decreased, and productivity for the game teams increased.

Our team ensured there would be no ambiguity about how the infrastructure was performing. We provided guarantees.

By stating that infrastructure needs to be more robust, I mean that it must ensure the application and its components are operational, that data is available, that no infrastructure changes have occurred, and that the system can recover if needed from a backup without requiring the system to be rebuilt.

And that, in an era of clustered applications, clustered infrastructure that can give those guarantees by default, like Nutanix, is the only way forward.

the architecturalist 62: people develop tools, software is a means not an end

August 22, 2025 by kostadis roussos Leave a Comment

In 1994, I was told by a visionary professor of Computer Science that I was a fool for going into CS because the combination of component software design and offshoring was going to eliminate jobs.

I remember being pale in the face and sticking with it. At the time I graduated, there were 13 CS graduates, of whom two had cross-disciplinary fields. That class had the guy who invented Hadoop, and the folks who invented dtrace, and me (yes, I am putting myself in the same breadth, but that’s because we graduated at the same time).

Thirty-one years later, I see the same kind of fear-mongering.

The notion that computers will do software engineering or that there is a finite demand for engineered products remains the dumbest and most ignorant take in the history of takes.

AI is just the latest iteration in making each unit of software we write more efficient. In the 1980s, it was the move from assembly. In the 1990s, it was the move to garbage-collected programming languages. In the 2000s, the emergence of databases, hypervisors, and the web occurred. In 2008, it was the emergence of public cloud.

Does that mean that there aren’t dislocations and changes? No. In fact, in those transformations, jobs stopped existing, and folks had to retrain. And some of it was unfun.

But the idea that tool-making, design, and construction don’t require human beings is the fevered dreams of AI advocates.

the nutanixist 17: the consistent nutanix cloud platform

August 18, 2025 by kostadis roussos Leave a Comment

In an earlier post, I observed how VCF’s state was like the cat in Schrodinger’s box. It was impossible to determine the state because each observer had a different view of the state, and there was no consensus protocol.

Nutanix’s engineering team took a fundamentally better approach than VCF.

As always, read the https://nutanixbible.com for more details.

Both the Nutanix Cloud Platform and VCF have the same problem: how do you share state in a distributed system?

In particular, given a set of programs in a distributed system, they must all agree on the value of any shared state. If they don’t agree and don’t know that they don’t agree, then each program will make different decisions based on its view of the value. For more https://lnkd.in/gVHePnME)

Where it gets gnarly is when you have failures. And that for everyone to agree on the value, everyone has to see the same updates to the value in the same order (there are variations on this requirement).

For example, suppose I have three databases, and each database has a copy of my bank account.

My starting balance is 100$, I deposit 100$ and withdraw 150$

With no consensus protocol, the following is possible:

Database 1 thinks I have 50$
Database 2 adds an overdraft charge because it saw the withdrawal of 150$ after it saw the deposit of 100$
Database 3 thinks I have a balance of 200$ because it never saw the withdrawal

With a consensus protocol, there is only one possible outcome, namely that each database thinks I have the 50$ in my bank account.

Both VCF and NCP are distributed systems. VCF has a set of central databases (NSX db, vCenter Postgres, Operations) and a set of edge databases on ESX hosts. NCP has a single centralized database and a set of edge databases in the form of clusters.

I already discussed VCF, so today, let’s focus on NCP.

So how does Nutanix maintain consensus between the central database and the clusters?

Each cluster database notifies the central database of all updates in the cluster in the order they were made.

As a result, the central system always has a complete and consistent view of the state of the cluster.

And all of the products built on the central system have a single, consistent view of the state of each other and the clusters.

This doesn’t sound like much, and it’s why restore actually works on NCP.

When I restore Prism Central from a backup, it has a consistent view of every cluster. It is not possible (modulo a bug) that the backup will contain a state of the environment that never existed. Nor is it possible for different services to have a different view of the environment.

It’s why you can restore from backup and recover an environment, whereas with VCF, you must do a rebuild.

The problem with this system, however, is not just backup, it fundamentally affects scale and availability.

Why does this matter?

Obviously, because backup 🙂

But it also affects scale. And correctness.

The VCF system works because, although theoretically the databases are out of sync, the system is working very hard to keep them in sync. And so as long as changes to the environment occur less frequently than the time needed for each database to figure out what is going on, the system works.

So what? Doesn’t every consensus protocol impose some cost? Yes, but the VCF consensus protocol, such that it is, doesn’t guarantee that the state is consistent; it says the state should be consistent. So if you scale the system incorrectly, instead of the system becoming slower, it will behave incorrectly.

nutanixist 24: how nutanix made k8s persistant volume provisioning more reliable and available

Like this:

nutanixist 23: overcoming kubernetes and vm storage limitations

Like this:

nutanixist 22: the impact of system consistency in ahv vs esx and other systems

Like this:

the nutanixist 20: how to build an AZ using soft transactions, a clustered IO path, and a stateless hypervisor without a hyperscalar cloud network

Like this:

the nutanixist 21: architecture is why Pure and nutanix could deliver a great solution in record time

Like this:

the architecturalist 63: nutanix was the correct answer

Like this:

the nutanixist 19: the arrogance of the broadcom shift in cloud credits

Like this:

the nutanixist 18: relying on infrastructure instead of application specific availability

Like this:

the architecturalist 62: people develop tools, software is a means not an end

Like this:

the nutanixist 17: the consistent nutanix cloud platform

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: