22 architecturalist papers: multi-tenancy and quotas

Over the last years, I have gotten into a series of protracted debates about multi-tenancy.

What I have begun to understand is that it is essential to define the objectives of multi-tenancy before one starts to talk about it.

And even before we get to that need to define what is multitenant.

Consider a piece of hardware, say a server with four sockets. An individual owns the server. Another individual owns the building in which the server resides.

In effect, when there are two actors Mary and Tom, that have access to a system, that system is said to be multitenant if Mary and Tom do not trust each other.

But how much do they trust each other? The trust goes to how much the system must protect Mary from Tom and vice versa. For example, suppose Mary trusts Tom. Then Mary doesn’t care that Tom has physical access to the hardware. And Mary takes no actions to protect her data or her applications running on that server. In effect, Mary and Tom are the same people; they have different roles.

But suppose Mary trusts Tom, but Tom doesn’t want to damage Mary’s system accidentally. Identities and roles play a factor. What Mary would like to do is have a role that Tom can use that allows him to do the things he needs to do to Mary’s server and no more.

And so this is where things get complicated. There are two basic approaches; the first is to bake into the system the set of controls that Tom has access to and to use some role-based access system integrated with some identity system that determines what Tom can do. The problem with such an approach is that if Tom needs to do something that is not in the system, he has no way to do it and has to ask Mary. Now, if Mary is okay with that, all good, however, Mary may not want to do the task and may wish to allow Tom to do the job. But if the system has no way for her to do that, then she is forced to give him access to more controls than he is capable of using.

The second approach is to use layering. You create a net new interface that interacts with Mary system through some APIs, and that net new interface is what Tom uses. Thus when Mary wants to enable Tom to do something new, she will, Tom, can extend his tool to do that. The problem with this approach is that Tom now has access to a whole bunch of operations he shouldn’t have. The only thing preventing Tom from using those operations is his adherence to procedure and the fact that at the end of the day, Tom isn’t malicious. He’s a good guy.

My observation is that approach one doesn’t work. The reason it doesn’t work is the set of operations that Tom needs to perform is ever-evolving. Worse, the collection of activities that Mary wishes Tom to do is ever-expanding. And as a result, they end up using the second approach.

Okay, so what?

The problem is that too many people attempt to build the first model. For example, suppose I have an interface for interacting with the system. That interface allows me to create objects delete objects, or modify objects. Then what happens is that somebody decides that the hierarchy of those objects should reflect some authorization scheme. Then what happens is that Tom and Mary can’t do their jobs because the hierarchy or the complexity of configuring and setting up the hierarchy and setting up authorization is not expressible by the system. In effect, the hierarchy and system that allows you to create edit and manipulate objects for one task is not the same hierarchy you would use for another.

And so, ultimately, what you do is you create a tool that has a specific set of operations that Tom needs. Mary and Tom configure the tool so that it only does what it needs to do.

But, the advocates of the first system point out that the second approach is less secure. And they are right. Or I’ll take them at their word.

They ask, what if Coke and Pepsi want to run their software on the same physical servers. I always found that to be an absurd question. Even if we could assume that the system was entirely secure, there is human error. I thought that Coke and Pepsi would always buy their servers. What is interesting is that the market seems to be doing that even in the public cloud. The Nitro hardware that Amazon has produced mainly provides physical instances on a shared server. And this was before we discovered that there are architectural holes in our systems that allow data to leak between programs running on the same physical server that belong to different tenants.

And so, my assumption has always been that if you care about security, air gaps are about the only thing you should trust.

What does this mean?

Consider the server. With no software, it can do anything you could imagine. The minute you start running software, the set of things you can do becomes increasingly more limited. It turns out that there is a whole slew of user interfaces that are a lot more useful than just starting with the hardware. Over time, a set of interfaces for using a system and controlling access to that system have emerged. And we have figured out over many years how to make them address both Tom and Mary’s needs. A great example is the use of root and less privileged users on most operating systems.

A user interface for the system is handy to both Mary and Tom is incredibly powerful. And therefore, whenever a new way of interacting with servers emerges, there is a temptation to try to figure out what the boundary between the two tenants should be. The reality is that such interfaces developed after years of hard work and experience in operational practicality. Therefore in a new system, you are most likely to draw the boundary in the wrong place. And thus, in my mind, how you access a system should be independent of how you control access.

Okay, I’m dangerously close to talking about security, and when it comes to security, I know that I know nothing.

The problem is that even if you don’t care about security, another critical use case of multi-tenancy is to reduce the cost for the infrastructure provider, Indigo. What Indigo wants to be able to do is assign quotas to Mary and Jane and Tom. And what they want is to ensure that Mary, Jane, and Tom don’t ever exceed their quotas.

Amazon’s solution was to create an infinite supply of servers and bill Mary Jane and Tom for their usage.

The limitation of such a system is that you can only buy the set of servers that Amazon has decided to make available. The other limitation is that it assumes an infinite infrastructure.

If, however, Indigo does not have access to an infinite infrastructure or it’s inappropriate for their use case, what to do?

In my opinion, they should choose approach number two. What does this mean? There are a set of objects that Mary Jane and Tom use to do their jobs. Indigo has a set of quotas that they assigned to Mary, Jane, and Tom. Mary, Tom, and Jane’s need to refer to the quotas and use them transparently. And so there is a temptation to encode them in the objects. But instead of having quota enforcement done at every access to the objects, it should be done lazily unless you exceed some threshold.

And if you look at what Amazon does, they do the same thing.

If you want to use one more server, they will give it to you. If you’re going to employ 10,000 more servers, that involves a phone call. They have their quotas that are lazily enforced and, at some point in time, block access.

In effect, Amazon has decoupled secure isolation from quota enforcement.

And so, when we talk about multi-tenancy, what we need to do is ask, are we trying to solve for secure isolation, or are we trying to solve for quota enforcement? The requirements for security depend on the customer, the trust, the legal requirements, etc. How you do quotas is independent of all of those security restrictions and should be treated as such.

Share this:

Like this:

Leave a ReplyCancel reply