the nutanixist 28: cold migration as nutanix dr

I’ve been digging into the details of Nutanix DR., and as I’ve done so, I have begun to appreciate the staggering coolness of what was built.

In all infrastructure DR systems I am familiar with, the guarantee is that storage is replicated.

The DR process typically involves stopping replication, booting the servers, ensuring the OS runs correctly, and then starting the applications.

The challenge is that the servers’ source and destination configurations differ.

So what’s the big deal?

The big deal is that each server configuration is like a little database, and each database has to be updated.

The network configuration in Site A is different from the DR Site Network. So there was a lot of orchestration and energy expended to make sure that, for each server the applications failed over to, the network was configured to allow the application to fail over correctly.

What virtualization enabled, and when SRM first shipped, I felt like the heavens parted and the angels sang, was a solution to that problem.

Each server could have its own configuration, but the server was mapped to a virtual object in vCenter.

So instead of having to change N different databases, you only had to change one.

There was, of course, a gap. The gap was that storage replication, whether array-based or host-based, didn’t replicate all of the virtual machine’s state. ESX has the vmx configuration and the vmisk state, but it doesn’t contain the vCenter state.

To replicate the vCenter state, SRM was created.

What SRM did was to take a stream of notifications from vCenter and use that to create a new VM on the target vCenter. That new VM, at least to the best of my knowledge, had a different MOID than the source VM.

This added some complexity, but it also preserved the semantics of traditional DR, in that the remote server was a different server.

As a result, when a failover occurred, you ended up with a set of new VMs that your tooling had to account for. And there were ways of fixing this, so it wasn’t too bad.

At the core, the issue was that vCenter at the time had no native mechanism to replicate state between two vCenters.

Nutanix, on the other hand, took a different approach. They decided to replicate both the database state and the storage state.

On the DR site, they would then create the VM from the replicated VM state and run a recovery plan. What’s interesting is that the recovery plan would patch the differences, especially around networking, while keeping the VM’s identity consistent.

What was kept different between the two systems wasn’t the VM state and configuration, but the run book.

This meant that when they started the remote VM, it had the same identity as the source VM.

In short, they had implemented bulk cold migration across Prism Centrals.

Now, vSphere cold migration has some limitations due to how the database works. You can do cold migration and preserve the identities within the scope of a single vCenter. But if the VM moves to another vCenter, the identity, as described above, changes.

What it means is that bulk workload migration is as reliable as DR, and no different than DR.

Pretty slick.

Share this:

Like this:

Leave a ReplyCancel reply