the nutanixist 12: the deeply misunderstood SPOF and availability

I recently had a conversation with a friend of mine about a company he worked for years ago that had no backups for its production environment.

The team argued that they didn’t need backups because they had synchronous mirroring.

The person pulled their hair out and asked, “So if a table gets corrupted, what then?”

What I find odd in the IT industry is that we think of SPOF as a hardware failure.

With modern systems, the total outage after a hardware failure lasts only minutes. Although that’s unfortunate, it’s not disastrous.

In contrast, recovering from a backup can take days.

Furthermore, recovering from a backup can result in data loss.

Therefore, the cost, complexity, and downtime associated with backups are so high that people consider backups the most crucial part of their availability strategy.

But they don’t.

I would say, “But this is a SPOF,” and folks responded, “Well, that is 5 minutes of downtime, not a big deal.”

And I realized that what I am saying is, “Since this cannot be restored from backup, we are one human mistake away from destroying all of this infrastructure and taking multiple days or hours to recover.”

When I became the architect of VCF, I saw a system that could not be restored from backup without intensive support from VMware customer support. A typical VCF instance comprises two NSX deployments with their respective databases, an SDDC manager with its database, and several vCenters, each with its internal database. Even worse, upon examining the products, they contain multiple internal databases and configuration files. Products like the supervisor, Operations, and Automation are further dependent on all these systems having the same view of the state to work correctly.

While I was the architect of vCenter, I ensured that backups worked effectively. I reviewed and examined the file-based restore feature. Every feature had to explain what would happen after a restore. I led the effort to make MOIDs stable so that, after a backup, VMs retain their original IDs. I pushed as hard as possible to get a DKVS so that the restore would work without breaking clusters.

What VCF has can be made to work. There is a prodigious amount of research in distributed systems that would allow this to work. However, simply reading the documentation, asking Google, and consulting Reddit will reveal how fragile the current system is.

What astonishes me about Nutanix is that Prism Central is routinely restored from backup by customers. Because it’s so easy to restore from backup, customers will restore from backup rather than try to figure out what broke.

Does that mean Prism Central always works? No. However, there is a qualitative difference between something that cannot work without any painful intervention and something that primarily works.

Why does Prism Central work? Because the team did the hard systems work that takes years to implement.

It is possible to architect a system that can recover from a backup. It is possible to build a system that minimizes data loss. The DKVS was part of such a system.

It’s just hard and takes time.

What astonished me is that Prism Central has such a system.

Trackbacks

the architecturalist 61: recovery from backup is AWOL in most DR planning says:

August 2, 2025 at 4:30 pm

[…] 12 hours.That experience taught me to think about backup differently.And it’s why when I say Single Point of Failure, I think about recovery from backup not a server crashing. When I talk to IT professionals and […]

Share this:

Like this:

Trackbacks

Leave a ReplyCancel reply