wrong tool

You are finite. Zathras is finite. This is wrong tool.

  • Email
  • LinkedIn
  • RSS
  • Twitter

Powered by Genesis

from the architecturalist and the nutanixist: Thank You!

November 29, 2025 by kostadis roussos Leave a Comment

Happy Thanksgiving, Folks.

It’s an American Holiday, and like so many holidays, its origin and history are complicated.

And it is a moment to stop and say thank you! And it’s important to say Thank You for the important stuff. My wife and son, first and foremost. My health is fine. My extended family, starting with my sister and her family, and my father. And my cousins, aunts, uncles, and nephews. And the people I play Dungeons and Dragons with. The people who let me mentor them, and the people who mentored me. The amazing community of folks I have interacted with on this platform. Even when we disagree, I learn.

And if that were all I had to be thankful for, I would be blessed beyond measure.

And yet, there is one more thing I want to be thankful for.

When I left Broadcom, I thought that my chance to build a global private cloud had ended.

I landed at Nutanix, and really thought I was joining NetGear. An SMB company that had delusions of enterprise grandeur. But the folks I talked to were really sharp, and they had something special, and I liked the space. I figured I would help them move the ball forward, and then retire. Going from SMB to Global Enterprise Cloud is a 15-year journey, and that was outside of my then-desired planning horizon for working.

And that would have been enough to be thankful for. But I also found a superb culture that was welcoming, open, and focused on what can be done, not what can’t.

But it was even better.

I had joined Arista when market dynamics made their technology, business model, and customer focus far more valuable. For those who have been following my posts, you have been on the same journey of discovery that I have been.

My first week, I walked around wondering, “Why the hell do these people even exist?”

For 9 years, I had made it my mission to have vSphere customers have no reason to look elsewhere other than vSphere.

And yet here they were. At first, I thought it was because Nutanix had just found parts of the market that VMware’s GTM had ignored.

But as time has passed, I realized what was really going on.

Nutanix has been on a 15-year mission to build the private cloud on a solid foundation of computer science fundamentals. A single set of consistent entities on a single global logical database that can scale to absurd numbers and is built on transactional infrastructure.

That took 15 years to build.

And when I dug into it, I realized Nutanix won, not because of GTM, but because they offered a unique product and platform. Precisely the kind of platform I imagined we should build at VMware, but I didn’t understand what that meant and missed critical details.

And so, at a point in my life where I thought that building the global private cloud was part of my past, it suddenly became part of my future.

I had unwittingly joined a company that had completed the 15-year journey.

And so when I was least expecting it, I got a second chance to finish a job I started in 2004.

Thank you, Nutanix

Which leads me to a postscript.

Some of my former colleagues wonder, “Did you just join the competitive marketing team? What have you done there?” and all I can say is, “I can’t wait to show you…”

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Share on Reddit (Opens in new window) Reddit
  • Share on X (Opens in new window) X
  • Share on Tumblr (Opens in new window) Tumblr
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on WhatsApp (Opens in new window) WhatsApp

Like this:

Like Loading…

Filed Under: Uncategorized

the architecturalist 64: steam and ibm show the value of staying close to your customer

November 18, 2025 by kostadis roussos 1 Comment

When I saw Steam announce the Steam Box, I did a double-take.

A buddy of mine, when I first brought up the Steam Box as a guided missile aimed at the X-Box and remarked that this was exactly what Microsoft should have done, said the Steam guys did it because they were gamers.

And it got me thinking.

For almost 20 years, Microsoft has been trying to own the gaming platform market as part of a broader goal to own the home. The Xbox came out of the era where folks thought Smart TVs were the future.


Microsoft tried to leverage its position as the dominant player in the desktop PC market to enter the home through the gaming console.

What Microsoft did was create an entirely separate gaming ecosystem centered on a platform they engineered. And they had orphaned all of those PC games as they chased the console crown.

PC games had remained tied to the desktop.

And there they remained.

Microsoft, at its core, is an OS company. And so the solution to winning this new market was to build a new OS with a new API, and have all the games in the world converted to it. And once they were converted to this new API, global domination would occur naturally.


Steam took a very different approach.

Two women playing video games on a couch.
Photo by Vitaly Gariev on Unsplash

As a gaming company, they saw the problem of games differently. Gamers take a game and hack it until it becomes the game they want. They will spend days finding bizarre tweaks to speed up the game. There is a whole community of people creating new and better assets for games. There is an entire community devoted to porting games from dead platforms to modern ones.

In that context, Steam approached the problem differently and asked: “How do I hack a game so it can run on a handheld device?”

They couldn’t ask developers to redo their games. So they leveraged technologies and techniques they had invented for desktop PCs to support the wide variety of input devices PC gamers want to use in their games. They leveraged the large community of folks who figured out how to tweak game customization to make games run on platforms the developers never imagined.

And using those two insights and some excellent hardware design, they did what seemed impossible: they created a usable handheld gaming device for PC games.

But what about Windows? Again, the Steam folks, leveraging their gaming heritage, didn’t let that daunt them. So they used the kinds of cross-platform technologies many gamers use and got it to work.

Does it work flawlessly?

No. But Steam knew its audience well. PC gamers are used to tweaking, fidgeting, and changing things. Why did they know them? Because they loved gamers.

The Steam Deck was the warning shot.

The Steam Box is the guided missile.

I worked at Zynga. And what I learned at Zynga is how much platforms hate games. Just look at how Facebook, Apple, and Microsoft turned what were viable gaming platforms into dust. Facebook and Apple are extorting unsustainable prices. The 30% haircut basically makes games unprofitable unless they are wildly successful, unless you are helping with distribution. And the Apple Store is awful. And Facebook actively suppressed its platform. Microsoft was determined to move gaming off of Windows onto the Xbox. Instead of making Windows better, they made it “acceptable” for gaming while trying to push gaming to the x-box.

The Xbox is a fine gaming platform, but it’s restrictive in the kinds of games you can ship on it, and the costs for game developers to produce a game are high.

And so the world looked like this: there were gaming platforms like PS5 and Xbox that offered an exceptionally curated set of games, while the PC gaming market was left to fester on desktops and laptops, where the broadest and richest set of games lived. And the dominant platform for PC games was doing very little to improve PCs as a gaming platform because their real goal was to get every game to run on the Xbox.

And so while everyone else did what they could to kill gaming on their platform, Steam chugged along. They focused on making something that was great for game companies, and game developers and gamers. It’s 2025, and basically, I play Steam Games. If the game isn’t on Steam, it doesn’t exist.

So Steam looked at the problem and said, “What if I put the Steam games in the living room?”

So they focused on building that. They did it by figuring out how to package the game so it could be played as a console game in a box. And having solved it for the Steam Deck, the Steam Box was a snap. In fact, the software stack can run on a wide variety of hardware platforms.

Why? Because the gamer’s ethos is to do that.

And so, after so many years, Steam has brought PC games to the living room at a very low cost.

Something Microsoft has failed to do, after spending billions on the X-Box.

In many ways, this reminds me of IBM. IBM had spent the last 30+ years in the wilderness of technology, but focused relentlessly on taking care of its customers.

And when the movement to move away from the Cloud happened, they happened to make the most important acquisition of the last 30 years, namely Red Hat. With Red Hat, Big Blue for a lot of companies has become the “trusted advisor” for modern workloads, with OpenShift being positioned as the right way to do modern kubernetes workloads.

Steam spent 20+ years being relentlessly focused on its customer base. Their relentless focus on a customer base others didn’t consider valuable has enabled them to secure a privileged position as a middleman. And in so doing, they have helped them to take advantage of long-term technology trends they can now leverage. It’s a fantastic statement on why you should stay close to your customers and the dangers of pissing them off.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Share on Reddit (Opens in new window) Reddit
  • Share on X (Opens in new window) X
  • Share on Tumblr (Opens in new window) Tumblr
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on WhatsApp (Opens in new window) WhatsApp

Like this:

Like Loading…

Filed Under: Architecturalist Papers

the nutanixist 27: the UI Fallacy at the heart of the control plane paradox – we built a database for the operator’s brain.

November 12, 2025 by kostadis roussos Leave a Comment

I recently read some material on VCF 9.0, and it discussed how one of the major improvements of VCF 9.0 was that it created a single UI and thus a single control plane.

a white rectangular object hanging from a ceiling
Photo by Sasha Mk on Unsplash

As a control and management plane architect, I found those discussions and proposals infuriating. The idea that the problem with an extensive, complex system’s operation was just about providing more convenient dials and knobs struck me as absurd.

I could not figure out why PMs, GMs, and vendors pushed this approach until I met a set of customers.

As a software engineer, the control plane is the thing that reads from a database and updates the datapath. The part that updates the database, reads from it, and returns information to the user is the management plane.

But when I spoke to the product operators while an architect at VMware, customers said the control plane was the UI.

At first, I found that odd, but then upon deeper reflection, I realized that they were right from their point of view. They saw themselves as the control plane, controlling the system.

The database was in their heads, and they used the UI to configure the system.

That -aha- was profound because it highlights the foundational tension in IT and infrastructure: where is the boundary between the human control plane/computer control plane?

My observation was that the more you can push to the computer, the more robust and reliable your infrastructure is. You can do more, react faster, and provide better reliability if the computer is in control.

Building a UI improves human productivity if you believe the gating productivity factor is the human. If you believe that the system is optimal and that the only path forward is to improve human productivity, improving the UI is the right answer.

I find the idea that we have the optimal infrastructure architecture absurd.

Saying to your business partner, “This is wrong!” And then they ask you – “So what do we do?” And you answering, “I dunno.” And then demanding nothing be done is absurd. In the absence of any other option, you do what is ridiculous. You optimize what you can. Fixing the UIs was the best answer, because the other answers weren’t that much better.

For years, I didn’t know how to build the right control plane that eliminated UI workflows relying on a human to be the control plane. And like most things, the answer stared me in the face.

It’s so absurdly obvious, it hurts to say it: to build an automated control plane, the control plane must be able to control and configure an entity fully, AND when the entity cannot be controlled by the control plane, it must stop within a guaranteed, bounded time frame.

Because for non-Nutanix customers, such a control plane doesn’t exist, and manual steps are necessary to handle the AND clause, it is fair to say that a single UI is an essential step to improve productivity. But it is an incremental step. A tiny, incremental step.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Share on Reddit (Opens in new window) Reddit
  • Share on X (Opens in new window) X
  • Share on Tumblr (Opens in new window) Tumblr
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on WhatsApp (Opens in new window) WhatsApp

Like this:

Like Loading…

Filed Under: Architecturalist Papers, nutanixist

the nutanixist 26: the unbounded step – how desired state systems create dual-database drift

November 10, 2025 by kostadis roussos Leave a Comment

white bow


Photo by Bruno Figueiredo on Unsplash

After I wrote about single databases and control planes, and their importance, folks observed that OpenShift and kubevirt use a single database.

And what I realized was that there was a deeper point being made: the Nutanix system guarantees that the host is under the control system’s authority, or it will automatically stop running.

The complex challenge of ensuring that the host runs only what you want is the first step, but the next — making sure it runs only when you can control it — is even more fundamental.

If the host doesn’t stop running automatically, the VM is both running and not running for some period of time. During that time, there are effectively two databases.

Because manual intervention is necessary, a third database comes into play: the operations database that detects this condition, and the system that actually reboots the machine or tracks its reboot.

The Kubernetes desired state system allows for this manual intervention by essentially waiting for the state to converge.

Desired-state systems enable these manual procedures, but the drawback is that they permit an implicitly unbounded step. Another system must address that unbounded step.

What makes this -reboot- more complicated is that, for stateful services, rebooting the host is expensive. The service may experience downtime or degraded performance. So when the reboot will happen—or whether it will—is unknowable. So the control plane that wants to take action on the VM or machine in this quantum state will have to stall.

Because rebooting a stateful service is not free, relying on the control plane connectivity to make the decision is fraught with operational challenges. Do you trust the control plane’s correctness more than the database or the OS that the control plane is running on? In other words, is the connectivity failure due to the control plane itself, or something else?

More specifically, is the problem in kubernetes? Or is the problem in the underlying host and VM? To reboot the host automatically, you have to believe that the control plane, in this case, kubernetes, is more reliable than any single host.

Brian Oki, a colleague of mine at VMware and the author of the original Viewstamp Replication, made the point to me early on, and it took me a while to appreciate: guaranteeing that the host is only running when it is connected to the control plane is a tough, but essential, property.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Share on Reddit (Opens in new window) Reddit
  • Share on X (Opens in new window) X
  • Share on Tumblr (Opens in new window) Tumblr
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on WhatsApp (Opens in new window) WhatsApp

Like this:

Like Loading…

Filed Under: Architecturalist Papers, nutanixist

the nutanixist 25: why nutanix’s single database approach eliminates SDDC orchestration

November 8, 2025 by kostadis roussos Leave a Comment

Over the past 14 years, across three companies, I have been trying to figure out how to build a deterministic SDDC. An SDDC that, when you reconfigured it, stayed reconfigured and didn’t require any human intervention.

I failed. I had to come to Nutanix to see what I was missing.

What’s the problem?

To control an SDDC, you must interact with a software control system.

And what is a control system?

Logically, every software control system has three elements: an API, a database, and the control software.

The API/UI/CLI updates the database, and then the control system reads the database state and updates the datapath.

But what if you have multiple control systems?

And that’s where I failed.

See, the original idea I had was that you could solve the problem by controlling the control systems via their APIs.

We call that orchestration.

But that didn’t work. Because

1) Each control system has its own database, meaning it can get out of sync with the orchestration layer.

2) When an operation needs to be performed by multiple control systems on the same entity, drift is inevitable because they have numerous databases

The first point is obvious: the controller’s API can reconfigure the control system, bypassing the orchestration layer. And thus break the orchestration layer’s model.

The second point is less obvious. If two controllers need to operate on the same entity and that entity appears in two different databases, unless you are using transactional updates, each control system is working on a similar but not identical entity.

And so what happens is that inevitably both (1) and (2) require a human to figure things out. Which means the orchestration fails, and now someone has to debug it.

And maybe I stumbled on that, but what I completely missed is the database for the Hypervisor/OS and the server image.

See, the Hypervisor has a database, the local configuration, and the workloads it was running. It also has an image.

And if the image and the configuration are not controlled by the control plane, then, by definition, the Hypervisor can also be out of sync.

So what do you do?

Well, you create a stateless hypervisor that relies on the control plane to tell it what to do.

Which brings me back to non-Nutanix systems like OpenShift, VCF, and Proxmox. None of them controls the entire system. They control parts of the system because the Hypervisor has its own database—the configuration files and the list of processes it starts up on reboot.

So I wrote that statement about databases, and Vytautas Jankevičius was right to say I was wrong. The way I wrote it isn’t accurate. For example, OpenShift has a single database that pushes configuration down. What I meant was something a lot more abstract. What I meant was that there was a persistent universe where, depending on who you asked, you got a different answer. For example, with OpenShift, if the machine doesn’t reboot, the VM will keep running. During that time, depending on who you ask, you will get different answers. And it’s that “different answers until human intervention” that creates a split-brain universe. And to resolve that inconsistency, you need a third system —or an orchestration —and a human. In my mind, if I can get two answers, and the only way to fix it is to have a human intervene, then there exist two databases that are sometimes in sync.

My point was too abstract, and I phrased it — well, wrong. In my defense, I was trying to squeeze this into the limited text window of LinkedIn.com …

Nutanix took a very radical approach: What if there was precisely one database? If there is exactly one database, every control system has the same view of the entity. And therefore every control system sees the same updates in the same order. Nutanix maintains that property by having the hypervisor shut down if it can’t reach the control plane. In other words, there is no arbitrary period during which the hypervisor is running and the control plane can’t control it.

Obviously, the next challenge is scaling that system. And for that, I recommend The Nutanix Bible.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Share on Reddit (Opens in new window) Reddit
  • Share on X (Opens in new window) X
  • Share on Tumblr (Opens in new window) Tumblr
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on WhatsApp (Opens in new window) WhatsApp

Like this:

Like Loading…

Filed Under: nutanixist

End of vSphere Standard: A Personal Tribute

October 30, 2025 by kostadis roussos Leave a Comment

gray conveyor between glass frames at nighttime
Photo by Tomasz Frankowski on Unsplash

The end of vSphere Standard on July 31, 2025, somehow passed me by.

Which is sad, because its existence was such a massive part of my life.

When I became the architect of vCenter, I saw it as an opportunity to make an impact on the world. Customers trusted VMware. And the world trusted VMware. So much of the world depended on vCenter and vSphere that I described the job as the most important job in IT.

I felt like I had been handed the keys to the kingdom and told, “Go figure it out.”


And -by-god- we did.

And the proof is 10 years later, when so many customers are mourning the end of vSphere Standard.

What was vSphere Standard? It wasn’t a bag of bits. Anyone can build a bag of bits. It was the commitment of the finest engineering team on the planet to take care of customers at all costs. At a cost even to VMware’s business.

VMware was relevant because of those 300,000 customers. It was those customers who made us irreplaceable. It was those customers that made us significant. It was those customers who allowed us to shape the direction of IT. Because we had that reach, we mattered.

Or at least I thought I did.

The number of things that we did to guarantee that vSphere Standard customers had a great experience at the expense of other customers was large.

I saw that trust as an obligation.

Folks would walk into my office and say, “Do X.” And I remember thinking, out loud and silently, that keeping our customers happy and never giving them a reason to leave was my first and only job.

And it wasn’t just me, but the entire organization. It was devoted to that customer base.

I feel a sense of loss to see the end of that relationship with those customers.

The customers aren’t small. They are real big businesses. They are businesses that relied on my team to do right by them. The idea that they are small is unfair to those companies that trusted us.

They are the guys who hugged us when we delivered the Supervisor, because we ensured he could keep his job and keep supporting his family.

When I see what happened, I feel a level of regret that maybe I shouldn’t have done what I did. Perhaps it was a mistake.

It wasn’t. It just shows you that change is inevitable.

And then I take it differently.

The outpouring of frustration from the customer base means that my team did right by you.

That for my mission: deliver stellar value, and for you to trust my team; I can declare: Mission Accomplished.

And I wish it wasn’t in your frustration that I found out.

Thank you for being great customers and trusting us for all those years.

And I do work at Nutanix 🙂 If you loved my work at VMware, you might find Nutanix worth checking out

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Share on Reddit (Opens in new window) Reddit
  • Share on X (Opens in new window) X
  • Share on Tumblr (Opens in new window) Tumblr
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on WhatsApp (Opens in new window) WhatsApp

Like this:

Like Loading…

Filed Under: Software

nutanixist 24: how nutanix made k8s persistant volume provisioning more reliable and available

October 29, 2025 by kostadis roussos Leave a Comment

One of the core impedance mismatches between k8s control planes and compute control planes is how disks are attached and the constraints thereon.

Why does it matter?

assorted electric cables
Photo by John Barkiple on Unsplash

Whereas with VMs, adding a disk is a relatively rare day 2 operation, in a k8s environment, attaching a disk is part of restarting a pod that failed.

In a previous post, I wrote about how the hypervisor’s host control plane prevents adding a disk to a VM while the VM is being moved.

And how that fundamentally affects the availability of the application that runs in kubernetes.

Now I want to talk about another challenge.

To create a virtual disk via the CSI, you must interact with the infrastructure control plane.

Now the performance, availability, and location of the infrastructure control plane matter.

With Nutanix, you can configure the CSI system to communicate directly with the PE. When you do that, our CSI provider provisions a virtual disk, and the CSI interacts with the underlying PE control plane running on the kubelet. What’s important is that if the VM is running, then the PE control plane is accessible because an endpoint exists on the same physical host.


If you do not use the Nutanix CSI in PE mode, the CSI provider must communicate with the PC. This can lead to issues where the kubelet is unable to provision a disk because it depends on an external system.

The VCF 9.0 product documentation includes an excellent illustration of this architecture.

This leads to an availability mismatch, which adds complexity. The external control plane must be more available than any host that creates a pod. The network must be designed to support that level of availability. While this is achievable, it introduces additional tradeoffs.

What I like about the Nutanix platform is the choice it offers. And depending on the tradeoffs that matter for you, you can make different choices.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Share on Reddit (Opens in new window) Reddit
  • Share on X (Opens in new window) X
  • Share on Tumblr (Opens in new window) Tumblr
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on WhatsApp (Opens in new window) WhatsApp

Like this:

Like Loading…

Filed Under: nutanixist

nutanixist 23: overcoming kubernetes and vm storage limitations

October 29, 2025 by kostadis roussos 1 Comment

Virtualization offers workload isolation and separates infrastructure and application teams. This separation allows operations such as vMotion to proceed without coordination, enabling host maintenance and infrastructure rebalancing to proceed seamlessly.



One notable limitation of Kubernetes (k8s) and virtual machines (VMs) is the interplay between pod deployment and persistent volumes. Platform engineers want the ability to deploy pods and create storage on demand quickly. However, the virtual machine abstraction complicates this, making pod deployment more challenging and negatively affecting application availability.

For instance, when a virtual machine has a single virtual disk and needs to attach another, this operation is blocked during mobility tasks with hypervisor-attached storage.

Now, in a traditional VM environment, adding a virtual disk isn’t too big a deal. Adding another virtual disk is not a typical day 2 operation.

But in k8s, whenever I deploy a new pod to a VM and want persistent disks, I have to add another virtual disk.

So now, whenever the infrastructure admin wants to perform a rebalancing or maintenance, they must coordinate with the platform engineering team or the application team.

The whole point of virtualization is to provide isolation, yet because of this behavior, you lose it.

So co-ordinate!

Except for one critical use case called “High-Availability,” where the VM is rebalanced both before and after a server failure. So when a server fails, if a pod fails, and your VMs are being rebalanced, your pod restart can hang for an indeterminate amount of time. And if it hangs, then your application either runs in a degraded mode or doesn’t run at all.

This limitation exists for all KVM-based hypervisors, to the extent I am aware, and for VMware hypervisor-attached storage.

Nutanix, however, offers another class of storage, called a “volume group,” that has been available for 5+ years and allows a guest to attach to a virtual disk via iSCSI.

Nutanix calls that a “guest attached” volume group.

There is a trade-off, of course, in using this iSCSI layer. The Nutanix CSI driver handles the details.

In a vSphere world, you could use iSCSI to an external storage array from the guest, which introduces another set of trade-offs. It also complicates the environment’s operations. vVols tried to make that better.

With Nutanix, the nice property of the volume group is that I can attach multiple virtual disks and apply data management policies to the volume group, such as snapshots and DR, so that as new disks are created, they inherit those policies.

And so I get the simplicity and flexibility of virtual disks, without any of the day-2 headaches of hypervisor-attached storage.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Share on Reddit (Opens in new window) Reddit
  • Share on X (Opens in new window) X
  • Share on Tumblr (Opens in new window) Tumblr
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on WhatsApp (Opens in new window) WhatsApp

Like this:

Like Loading…

Filed Under: nutanixist

nutanixist 22: the impact of system consistency in ahv vs esx and other systems

September 27, 2025 by kostadis roussos Leave a Comment

AHV prioritizes system consistency over workloads, whereas ESX and every other OS prioritizes workloads over system consistency.

a wall with a lot of circles on it
Photo by Maria Teneva on Unsplash


If you examine the fundamental difference between AHV and ESX, once you set aside the features, APIs, and opinions, the most basic question is: “When is the host down?”

ESX asserts that as long as the kernel is running, the host remains up because a workload may be running or about to start. Even if the kernel is unreachable from the outside, ESX continues to run. The only person who can decide the host is down, therefore, is a human.

AHV, on the other hand, believes that once it is no longer part of the quorum, the host is down.

Both approaches have value, but they yield different outcomes.

With ESX, the human has complete control over deciding when to restart a host. Because only the human knows whether the host is running, each additional component must continue functioning until instructed otherwise and must keep operating even if other system parts are down.

It’s why, for example, with vSphere HA, even if the network is partitioned, all hosts will run workloads.

Until the human indicates that ESX is down, all system components should assume a workload is either running on the ESX host or may start running, so they must try to keep running as well.

The difficulty is that a malfunctioning piece of software can appear just like one that is very slow.

Therefore, each layer advances without knowing if another will do the same later, which can result in incorrect decisions.

A trivial way to prove this is with backup and restore. When you restore a system from a backup, to the outside observer, that’s indistinguishable from a very slow system. The ancient system must now catch up with the current state of the world. To do that, it has to be able to read the current state, but there is no precise current state. So at some point, a human must be involved to resolve inconsistencies. It’s why restoring VCF is so painful.

The benefit of this approach is that it allows surprisingly fast delivery of components, as long as integration and consistency are less critical than the speed of feature delivery for each one.

The downside is that when two systems need to agree on the system’s status, they cannot. Because only a human knows if the system is up, down, or slow, any software trying to coordinate between two components can only make an educated guess about what’s happening.

To handle this, you need to invest in more tools, monitoring, and observability. But it’s always a guess.

The alternative approach of AHV has the property that the software systems are aware when the host is down, since the computer makes the decision independently of whether workloads are running on the AHV host.

More importantly, any workload on that AHV host will not run until the cluster control plane reinstalls it on the host.

As a result, any layered system knows exactly when to stop.

Consequently, all layered systems are aware of each other’s state.

And all parts of the system agree on the state of the workloads.

The upside of this approach is that the system is correct, scales better, and is simpler to operate and develop against. The downside is that until the quorum system is more reliable than a single kernel, your system is less reliable.

What AHV has done is make its clustered system as reliable as a single kernel. And that is an astonishing achievement.

Once that is achieved, and if overall system behavior is more important than any one single system, then the simplicity of the AHV approach allows for faster feature delivery because the complexity of integration is simple.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Share on Reddit (Opens in new window) Reddit
  • Share on X (Opens in new window) X
  • Share on Tumblr (Opens in new window) Tumblr
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on WhatsApp (Opens in new window) WhatsApp

Like this:

Like Loading…

Filed Under: ahv, nutanixist

the nutanixist 20: how to build an AZ using soft transactions, a clustered IO path, and a stateless hypervisor without a hyperscalar cloud network

September 21, 2025 by kostadis roussos Leave a Comment

I’ve been pondering the problem of making infrastructure transactional for 20 years.  

The one paper I wrote – https://www.usenix.org/legacy/event/lisa07/tech/full_papers/holl/holl.pdf is an early attempt at trying to get the desired state systems to work. 

You can read the paper, but the critical idea (and it’s an ancient one) was that you take all of the control plane code and put it in the central system. 

The problem with that approach (and why the product failed) is due to availability. 

The thing we built had the nice property of simplicity of management. It had the unfortunate property of being less available than what it tried to replace. What do I mean? Our solution required a single centralized control plane. If that control plane failed, then snapshots, mirrors, and backups failed. Without our control plane, each NetApp Filer managed its own schedule and failed independently.

Storage administrators barfed all over it. They rejected the product and the architecture.

Then I went to Zynga. And there I took another stab at the problem of managing systems at scale. And there we built some pretty slick management software that allowed Zynga to scale to 100 million MAU for Cityville, on what was basically the flakiest infrastructure I have ever used. The critical insight I had at Zynga was that since transactional systems at scale didn’t work with a centralized database, you needed to build something that relied on eventual consistency.

Then I came to VMware and decided to tackle the problem of deterministic infrastructure at scale again. That’s when I realized there wasn’t really a solution to my problem. 

Abstract spiral pattern with warm light and shadow.
Photo by Milad Fakurian on Unsplash

What was my problem:

I had several hundred distributed databases (one per cluster), and I wanted to manage particular semantics that didn’t quite fit into a cluster’s semantics. For example, networking spans clusters. 

And I failed to come up with an answer. 

What do I mean? The current system requires manual intervention to keep running. The new eventually consistent system also required manual intervention to keep running because it wasn’t deterministic.

So what was the win? Unclear. But there was a win around per-cluster state, and so we decided to solve that. Working with Brian Oki, who did most of the heavy lifting, we devised a plan to make forward progress. We decided to push the cluster state into the cluster.

We began working on an internal project called Bauhaus, despite not having a definitive answer on how to approach networking. Bauhaus was about moving some of the cluster state into the cluster using a distributed KV store to simplify recovery and improve resiliency. 

The critical insight I didn’t have was “AZ” 

An AZ is one of those concepts that practitioners of distributed systems have spectacularly failed to define, and it is the most fluid of all.

Ask 50 practitioners and you get 50 answers. 

And because of that, it’s too amorphous to build systems with. 

But there is a crucial insight about an AZ: 

An AZ is a control plane that, when it fails, the hardware it manages becomes unusable, even if the hardware is powered on.

An AZ from the outside observer’s perspective is one thing. 

But the critical activity in cloud engineering is “how do I build an AZ so it appears to be one thing, but is actually built from many things.” 

The thing that’s not obvious to folks who don’t spend too much time puzzling this problem is how the network is built in the cloud.

If you examine the cloud, the critical aspect of their systems is a highly redundant and substantial bandwidth inter- and intra-data center network.

Every cloud has its own proprietary networking stack, which, when you interact with it (from the underlying, not the overlay), requires a significant amount of bridging magic. Those underlay networks do not have all of the semantics or properties of traditional IP networks. 

It’s the existence of those networks that allows for the cloud to provide a transactional system behavior. 

So let me be precise: 

In the cloud, I can assert that if I can’t reach a node, the node is down. 

If I can’t reach the AZ, it’s down. 

And if a VM was created in AZ 1, it’s either running in AZ 1 or not running in AZ 1. It can not exist outside of AZ 1. 

Without the cloud networks and the fact that every part of the system was engineered around this principle, building an AZ-like construct on premises was very difficult without extensive investment in network and hardware design. 

What these Nutanix guys did is figure out how to work around this using a custom data path and soft transactions. 

Rather than relying on the network connectivity to determine if a VM is running or not, they used the IO data path and a stateless OS. 

The IO data path guarantees that any hypervisor that boots cannot access any state that the clustered control plane doesn’t want it to access. 

The stateless OS allows the cluster control plane to program the OS to its new state trivially. 

The existence of a clustered IO path and a stateless hypervisor allows the cluster to control what state is being modified and which workloads are running. In effect, the clustered I/O path and stateless hypervisor enable the cluster as a whole to operate as a single entity.

As I mentioned earlier, soft transactions and a distributed database are what enable this scalability.

In this incredibly long and complex journey, I was fortunate to work with some brilliant people, but a critical person was Dahlia Malkhi, who, when I hit a brick wall, made it possible for me to see the path around it. I call her out because she was a researcher, and we may have interacted on a technical topic 2 or 3 times, and each time was seminal.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Share on Reddit (Opens in new window) Reddit
  • Share on X (Opens in new window) X
  • Share on Tumblr (Opens in new window) Tumblr
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on WhatsApp (Opens in new window) WhatsApp

Like this:

Like Loading…

Filed Under: nutanixist

  • « Previous Page
  • 1
  • 2
  • 3
  • 4
  • 5
  • …
  • 27
  • Next Page »

Loading Comments...

    %d