
From 1996-2009, it was believed that application availability was an infrastructure issue, so improving reliability meant making infrastructure more resilient. Tandem NeverFail symbolized the ideal of reliable infrastructure, with systems like the Origin 2000, featuring a single system image and NUMA, representing the peak of scalable computing. However, in the mid-1990s, Gregory Pfister’s book “In Search of Clusters” argued that building such infrastructure was too complex, advocating instead for clustering.
At the time, this idea seemed absurd. When distributed systems were first being deployed in the mid-1990s, it seemed truly crazy.
As a result, infrastructure vendors continued to focus on making single systems more resilient.
When the cloud emerged, infrastructure architects like myself viewed it skeptically because of its lack of guaranteed availability. “How would applications run on it?” we wondered.
What we didn’t realize was that software naturally seeks to operate on cheaper hardware, and because of this, new technologies have arisen to make that easier.
For me, the pivotal moment came in 2009, when Cafeville was running on an effectively 1000-node cluster. The team combined various components with some critical innovations.
This marked the beginning of an era where availability shifted from being purely an infrastructure concern to an application problem because infrastructure became less reliable.
My critique of vSphere and similar systems that aren’t natively clustered is that they are inherently less reliable than what applications require. Consequently, application teams must write code assuming infrastructure instability rather than depending on the system’s reliability.
What do I mean by infrastructure instability?
In the pre-Cloud era, infrastructure was assumed to either work or fail. In the cloud era, uncertainty in infrastructure was acceptable if the application, its developers, and the operations team could identify what went wrong—uncertainty was tolerated.
The problem was that this increased the cost of maintaining and supporting applications and slowed down development, as teams spent more time on infrastructure issues than on the applications themselves.
At Zynga, when my team provided a reliable infrastructure, team sizes decreased, and productivity for the game teams increased.
Our team ensured there would be no ambiguity about how the infrastructure was performing. We provided guarantees.
By stating that infrastructure needs to be more robust, I mean that it must ensure the application and its components are operational, that data is available, that no infrastructure changes have occurred, and that the system can recover if needed from a backup without requiring the system to be rebuilt.
And that, in an era of clustered applications, clustered infrastructure that can give those guarantees by default, like Nutanix, is the only way forward.

Leave a Reply