50 architecturalist papers: portability and the multi-cloud

Over the last 15 years, I’ve been noodling about portability. And the more I noodle about portability, the more I think it’s very poorly defined.

So, let me try.

In the context of computers, there are two flavors of portability, data path portability, and control plane portability. The data path is the bits of code that do real work. The control path is the code that sets up the data path, monitors the data path, and takes corrective actions.

In that weird way that computers work, the control path itself is just another data path. And so – in a bizarre way – there is only data path portability. Simplistically, the control path is instructions that execute on processors, store data in memory and storage.

And although Paris Kanelakis would be pleased to know I learned something in CS051, things are not- quite that simple in the world of commerce that I live in.

Except when they are.

The VM encapsulates a traditional single VM applications control and data path. And the VM, as a result, is a highly portable abstraction because the hypervisor can ignore which instructions are from the control path and which data path; to it, they are all data path instructions. And we know that if you don’t care about performance, any data path can simulate any other data path.

So far, so good.

VMware demonstrated that the portability of a VM is very valuable because you can move it between generations of hardware, move it around to utilize the efficiency of your infrastructure better, and because it’s a convenient way to manage workloads.

And so, the VM abstraction and its portability had another interesting effect. You can hand over the infrastructure operation to another person.

But while I was at NetApp and Zynga, there were all sorts of other portability that people discussed. This data path portability was nice but not sufficient, and it wasn’t enough if you couldn’t encapsulate the entire app into a single VM.

So sure, I could move the VM around, but if I didn’t have the control path, what value was the VM?

And so there was this thought process that said, “VM’s are not that interesting.” are

And I looked into all sorts of interesting programming languages and tool chests.

And yet, along the way, something funny happened. If I cared about performance, I cared a lot about the hardware I was running on. And all of a sudden, the behavior of the hypervisor matter a lot.

At Zynga, when Amazon changed their NIC, our costs skyrocketed. Their decision to optimize their business almost tanked ours.

But performance, who cares about performance? Right? Except, here’s the rub, when you don’t have pricing power for a feature, and you sell the hardware that the feature runs on. The only way it can be worth implementing is to implement the feature without meaningfully increasing the hardware costs.

And then the performance of the feature matters a lot. And some features don’t get implemented because there isn’t an efficient way to do that.

But if you can’t charge more money, why implement the feature? Because – here’s the rub – your competition is adding features very fast. And if you don’t stay ahead, the whole SaaS model suggests that eventually, your customers will move. So as a vendor, you have to keep adding features while keeping the cost of the features in check.

Hoom. So – wait, if I want to grow my business and add more services, and can’t meaningfully increase the price, then, erm, the performance of the feature matters a lot.

And yes, you can make code more efficient and thoughtful, but at some point, you start caring about whether the disk is an SSD or whether the NIC has enough buffers, or what exactly is the CPU you are running on.

I mean, if people didn’t care about these things, those things wouldn’t exist.

Okay, so?

Well, once you start getting to the point that you care about performance that much, then the hypervisor details matter a lot. But surely you jest? Hypervisors make choices, and those choices impact the way your applications run. And if you design an application to run on a particular hypervisor, then the portability of your application is the portability of that hypervisor. If the hypervisor only runs on hardware X or Y, or Z, any performance improvement locks you into that vendor.

To put it differently, if the control path of the hardware is in the hypervisor, and you need to control the hardware, you need to control the hypervisor.

So what? It turns out that the hardware vendors are adding more hardware faster than that integrated solution vendor is sharing. And so, being able to have some degree of portability to run on other hardware will matter a lot.

Because why? Because the more it costs to run a feature, the less money you have and the more money the other person has.

Hoom.

Baremetal! BAREMETAL!

I hear you say. I was cheeky. See, every modern OS is a hypervisor of some kind. When you run on a public cloud, you can’t control the hypervisor.

And in some ways, not to be annoying, running your app in a VM on a cloud is like trying to tune your Java code – optimizations can be done, or you can use C or Rust or C++.

So when I hear bare metal, I don’t hear “no hypervisor”; I hear – – run a hypervisor of your choice on hardware.

BUT BAREMETAL IS HARD. I agree.

So?

Well, here’s where things get very interesting. Suppose you had a portable hypervisor that existed, and you didn’t have to do the hard part of managing the hardware but could control said hypervisor to your heart’s content?

Why then, for those applications where the data path was critical, you wouldn’t be tied to the cloud vendor’s hardware choices.

Data path portability is about being able to run your VM anywhere you want and be able to control the hypervisor.

AHA!

So for the essential thing that can move my costs, I need to move my VM to whatever hardware I want and tweak the hypervisor any way I want.

But then why has the public cloud won so much business?

Well, for the same reason, Java has won over C++.

It’s just got a better control plane for application development. Java makes so many things easier.

And really, that portable hypervisor was painful to use compared to the cloud.

So hypervisor portability was nice, but it was uninteresting if you couldn’t run the control plane. And you couldn’t because it took a lot of time and money to build, and the value of doing that work was marginal. Except when it wasn’t like for Dropbox. Or at Zynga, where we designed our control plane to be portable and had options when Amazon decided to optimize their business at our expense.

But the control planes aren’t tied to the hardware platforms as much.

What does that mean?

Well, it means that for some workloads where the costs matter, using the non-proprietary control planes so I can use the portable hypervisor may be a better choice.

And well, a portable control plane is – to a hypervisor – just another data path.

REPATRIATION!

No.

No software service that exists and increases in value hasn’t been optimized and improved over time. The original S3 relied on MySQL databases, for crying out loud, and Facebook used a single NetApp filer.

Thinking about this as “repatriation” is the wrong mental model. The suitable model is that critical services will get optimized, and at some point, the optimizations will care about the hardware. The existence of portable control planes and portable data planes will be fascinating.

But I have to build my data center!

Nope.

See, those portable hypervisors are pretty much available as a service now.

And that leads me to the final round of thinking.

When application control and data plane portability is possible, will the ability to cram a small number of hardware configurations in data centers be that valuable?

Share this:

Like this:

Leave a ReplyCancel reply