05 architecturalist papers: the rules of rewrites

Over the last 20 years, I have been involved with or lead several rewrites of large systems. And that experience has taught me some basic rules of how to successfully do a rewrite. And they are pretty simple.

The business reasons have to make sense to every engineer and the business teams and everyone has to believe them.
The business leaders had to be committed to the rewrite. The bigger the scope of the rewrite, the more important the business leader. If you are rewriting the entire product, then the CEO has to be committed.
Once the plan was agreed upon, the entire team had to be working on the plan, and we had to deliver business value as soon as possible.
Optimize for the right strategic software architecture over the right operational or tactical software architecture.

The first rewrite I was peripherally involved in was a project at SGI called Cellular IRIX. Cellular IRIX was an attempt to build a multi-kernel single memory address space system to support SGI’s new cc-NUMA architecture.

The rewrite failed because there were too many constituencies who opposed the rewrite and too many business leaders who didn’t understand the value leaving too many engineers confused about why they should sign up for this.

When SGI imploded and some key engineering directors got fired, the project died a miserable death.

The second rewrite I was involved in as an individual contributor was the rewrite of the NetCache product and that was a successful rewrite.

The original version of NetCache was a port of software written by Internet Middleware Corporation. The system suffered from some flaws, flaws that were very hard to fix. If my memory serves me well, it was built on a callback-based system that had unclear and uncertain rules about how resources were being managed. There were if-statements littered throughout the code where function writers were expected to understand all paths that could result in a message with the state calling their function. In effect, to understand how to write a leaf function you needed to understand all possible states of the global system.

Although a technical mess, the reasons the rewrite succeeded have little to do with that. The technical mess and the near impossibility of fixing the technical mess were not why we successfully rewrote NetCache; it’s why we wanted to rewrite NetCache.

We were able to successfully rewrite NetCache because the performance and availability and feature velocity were a serious business problem and there were no alternatives on the table, and yet that wasn’t enough.

The real reasons are that

Everyone was working on the rewrite.
We knew that our new hardware could not be fully utilized using the old architecture
There were no constituencies in favor of fixing the old code base. The product managers didn’t want, the engineers didn’t want it and the customers didn’t want it.
Everyone from the GM down was committed to the rewrite
The code base caused business problems everyone understood.
The investment in the rewrite was 2x the total man years in the original code base.

After the NetCache rewrite, the next failed rewrite was of ONTAP following the Spinnaker acquisition.

NetApp decided to buy a great company with a great engineering organization, called Spinnaker. Spinnaker had a clustered file system and an embedded namespace virtualization and worked. Unfortunately, post-acquisition we failed to properly deliver an integrated product. Instead, we delivered a new version of ONTAP, ONTAP-GX that was not compatible with ONTAP. And then after we delivered that product, started another attempt to re-integrate the Spinnaker technology into ONTAP, an effort that became known as ONTAP 8.0

There were a lot of reasons why this effort failed, and I was in the periphery of the effort. And it’s tempting to point fingers at people I respect a lot. And so I won’t. Because tactical or operational considerations of the failure are of no interest to this blog. What is of interest, is what was the strategic software context that doomed this effort from the get-go?

At the time of the acquisition, the kind of performance and scale Spinnaker offered were of interest to a tiny part of the overall market. More importantly, the kind of virtualization Spinnaker offered – namely global namespaces was of less interest.

By 2004, the entire storage industry had decided that the FC and it’s evil step child iSCSI was the most important protocol on the planet because structured data was the growth business. And structured data lived in databases.

And NetApp’s business problem was to figure out how to address FC and iSCSI not how to insert namespace virtualization. And more importantly – at the time – there was little interest in core storage innovation. Dave Hitz, the EVP, and founder of NetApp, wrote a future history paper that said as much.

NetApp’s business problem was how do they sell more of what they had to more different customers, not a new storage array.

And every engineering manager, director, and engineer knew this. And so we had a constituency at NetApp develop that said – sure I’ll invest in this new system and at the same time, I will invest in the old system. I know about this alternative plan, because as an architect for the storage management team, I was 100% aligned to wait for the ONTAP rewrite to fail.

The original plan called for all of the engineering team to pivot to the rewrite. And then the pivot didn’t happen. And the pivot didn’t happen because when problems occurred with our core business, investing in a rewrite that wasn’t business critical became a luxury. And so the rewrite got starved of resources.

At the core, if I can use 20/20 hindsight, the mistake at a strategic software architecture perspective was the decision to have two teams, one working on the original product and one working on this rewrite. The core team felt that they were stuck doing sustaining and the rewrite team didn’t have enough resources to compete.

And the core manifestation of this separation was that the new ONTAP had a different build system and source code repository than the original ONTAP.

Much like Cellular IRIX, there were too many people looking outside in, and trying to find ways to cause it to fail.

The company more or less figured this out after ONTAP GX failed in the market. They realized that the only way to go forward was with one ONTAP that the entire company owned making it more cluster aware. And it didn’t hurt that by then; people felt that that the kind of storage system that ONTAP 8.0 was building was something that could be market leading.

The rest, as they say, is history.

The next rewrite I was involved in was the rewrite of Zynga’s backend, something I lead.

The business goal of the rewrite was quite clear. Zynga wanted to be in the 3rd party market and to do that we need to offer 3rd party games services. We had services, but the API for using them were PHP libraries, and that wouldn’t work. So we had to offer services that they could consume via APIs from their apps that didn’ t have to be in our data center.

And so we decided to take all of our systems and export them over a network API to the world. And this became Zynga’s 3rd party platform that lived a short life, but the API infrastructure turned out to be very valuable for something else, and that was mobile gaming. Later on what we built got heavily modified and re-architected, and I like to think that initial effort, a project called Darwin lead the way.

From my lessons of ONTAP and NetCache, I took three key lessons

There had to be a compelling business value that was non-technical.
The business leaders had to be committed to the project, not the technologists. There could be no way for someone to appeal to some exec higher in the chain to reverse course.
Once the plan was decided, we had to go very hard very fast with the entire team and deliver business value as soon as possible.

And that’s more or less what we did. I remember sitting in a room with Cadir, the CTO, and owner of all central engineering, where we made the decision to do this. And I told him I need 18 months, and he told me I had six. And then I remember telling him we needed everyone to work on this or not to bother at all, and he said – let do this. I continue to admire him for that level of commitment.

And to make sure everyone knew he was personally committed to this effort, Cadir lead an all hands where he personally said that this new effort to rearchitect the backend was something that he was committing the entire team to do.

We then had a series of architectural meetings where we figured out how to solve the fundamental problems of making our services available to the internet. At the time, we chose systems that were already in existence over writing new ones. Our goal was to have a working system in a quarter and something we could sell in six months.

Those meetings included every architect and constituency, and ultimately decisions were made that pissed people off. Some that were painful for me later on in my career, as I ran over people when I could have just listened and taken them along for the journey.

We did succeed in an environment where people were quitting post-IPO and then quitting because our business was suffering.

And by success, I don’t mean what we shipped was the right software, I mean we shipped the right software architecture. And the organization was reorganized around the idea of delivering services through APIs that were centrally managed and accounted for.

And that brings me to rule #4 – Optimize for architecture not for software implementation.

Because Cadir forced me to go fast, we couldn’t figure out the right thing to do in all cases. He forced me to prioritize what we needed to accomplish and what we needed to accomplish was get the APIs on the internet fast, not have the best pieces of software to do that.

The point he taught me and made clear is that we can always improve a shipping product, we can’t improve a project that got canceled. And that if the architecture is put together right, then the bits can be enhanced over time.

And because architecture and org structure are the same things, once you reorganize to implement an architecture, you can quit the company. The company will naturally improve the architecture if the business problem remains. And that happened.

Share this:

Like this:

Leave a ReplyCancel reply