Architecturalist Papers

13 architecturalist papers: getting to mainstream

June 9, 2018 by kostadis roussos Leave a Comment

Recently one of my co-workers asked about how do I value innovation. And so I started with a formulation I described before (here: https://wrongtool.kostadis.com/a-relativistic-theory-of-innovation/).

And in our discussion, and because many years have passed since I first wrote about it, I realized I was using a more sophisticated model.

The most significant bit from that earlier post is that there are regions of innovations. And within an area, you can innovate, and the next thing to do is reasonably apparent to those who are in that region and may appear as magical if you are further behind.

When I think about innovation and products, I think about these four regions:

The edge of human knowledge
Innovative commercial products
Mainstream products
Laggard products

I also think of these regions as waves, where the innovations in the bleeding edge wage move downstream.

Most academic research is in the first region and tends to be 5-10 years away, at least, from being part of some innovative product. And most research is intellectually exciting but of marginal commercial value.

Being able to predict which of the research technologies will win is impossible, and the surface area of opportunity is huge, and the investments that are required to turn those ideas into products is vast.

The biggest commercial wins of all time are when someone can take a research idea and turn it into a commercial product. Examples of such things include – server virtualization and network virtualization, the search of the web, cloud computing.

However, as a strategic software architect, the job isn’t to find that disruptive new technology, the position is to keep a sizeable successful software business successful and relevant.

Searching for that next disruptive idea is only part of the equation.

The equally important part is managing your investments in delivering commercially viable innovations, ensuring that the mainstream market is satisfied with the product and that you never fall behind the mainstream requirements and become a laggard – primarily avoid four at all costs, always be on 3, and do some of 2 while looking for 1.

A challenge is that although a strategic software architect may be thinking about all four regions at the same time, most teams are in one place and are focused on their backlog for their region.

If your goal as a software architect is to move a team downstream from being pure research or upstream from being a laggard, you will effectively deprioritize the obvious set of next work with something different. Something more risky because it’s not the obvious next thing.

And if you are a laggard, the product will be suffering from a lot of code rot problems. And just cleaning it up may feel like the top priority.

And that would be the wrong thing.

You are a laggard because you are missing features.

If your code sucks, and your build times are horrible, etc., it’s tempting to think that you are a laggard because of that. In fact, that is irrelevant. You are a laggard because in the market you have a feature gap between the laggard and mainstream market products.

The real objective is to own most of the mainstream market and to offer innovative features that capture premium value while hoping that you discover some groundbreaking innovation.

And because you are a laggard, the market doesn’t look for you to innovate, they look for you to deliver the things that they want in the way that they want. The mainstream has defined its requirements, and you must fit into those requirements. A lot of times the temptation is to argue with the mainstream market, but that’s the wrong strategy. The mainstream isn’t interested in innovations, it has a particular point of view, and you need to satisfy it.

There may be innovative ways to deliver those solutions, but they have to meet the requirements of that market even if those requirements are – for lack of a better word – wrong.

What this means is that you have to add – sometimes – even more technical debt to get to the mainstream market so that you can have the revenue dollars and credibility to innovate.

In any new situation, the first job of a strategic software architect is to validate if the product meets the mainstream requirements and if not, all energy has to be focused on that.

Warning signs that you are not meeting mainstream requirements include declining market share, the field losing trust, emergence of direct competitors offering the same kind of product that is bought and deployed in the same way, decaying investment in the core product.

You have to direct your first set of investments and energy at getting out of the laggard space.

At NetApp when I joined the storage management team and was made their architect, this was precisely the situation we were in. The product, DataFabric Manager (aka DFM) had had marginal investment, and suddenly storage management had become a thing that customers cared about. Our field referred to DFM as Doesn’t F*king Matter and our customers were livid.

The first thing we had to do was identify the set of feature gaps that we needed to close to make the product not suck. And to be fair, the product didn’t suck given the investment. What had happened was that the market requirements had shifted on the team and the company had not made the investments to meet those requirements.

Once we identified those feature gaps, we had to code like mad. I remember sitting in a room with the head of engineering, and he said: Kostadis, I need you to disappear for the next few months. We need to code like mad. And I can’t have you distract anyone with your great next set of ideas.” The engineering head was dyslexic, and he never wrote anything. When he wanted to emphasize things he would take a pen and write it down, and he wrote “Code like mad!!” And he underlined it twice. I wish I still had that paper.

And so we coded like mad and got out of the laggard space.

I like to joke this is the easy part. The requirements are understood, the team top-to-bottom understands what it needs to build, the only problem is you have to go very very very fast and have to be very very very disciplined. The hard part is that a laggard product has a lot of internal challenges – a lack of resources, talent and brain drain, and a lot of internal corporate vultures. The internal corporate vultures are the worst. Instead of fixing the laggard product they decide that the right answer is to buy another product, or to do a rewrite or to fire everyone or to do anything but fix the feature gaps.

It’s mind-boggling. In fact, if you walk into a situation where everyone wants to re-write the core product, it probably is indicative of a laggard product.

And the right answer isn’t to do a rewrite but to fix the critical feature gaps.

So when we delivered that next DFM release, we didn’t advance the architecture, we didn’t solve the underlying technical problems, and we may have added some tech debt, but we got to the mainstream, and that set us up for trying to get to the innovative product space.

And we did.

12 architecturalist papers: people write software

May 30, 2018 by kostadis roussos Leave a Comment

When I first became an architect at NetApp, I thought the job was to draw a picture, get the picture approved and then the software would magically be written.

The mental model I had was that there was this massive “power-point to product” compiler and all I had to do was draw the power-point.

To my surprise, it was a little bit more complicated.

People write software, and people are not computers. People have emotions, aspirations, interests, career goals, dislikes, strengths and weaknesses. And those people write software.

How does this influence software system design?

In the first phase, you need to figure out what the right system is. Correctness or appropriateness of a system is independent of human beings.

But then to get it implemented, you need to understand your team.

There will be skills your team has, and there are skills your team needs to acquire and there are skills your team lacks and can’t learn and you need to go find in the marketplace.

And then your job, as a systems architect, is to figure out how to build something with the people you have that adds enough value so you can stay alive.

And sometimes it means you have to wait to hire the people you need.

In many ways, this process feels like being an author of a screenplay who tailors the screenplay to the actors you hired.

One of my projects at Zynga could not start until I hired someone who understood filesystems. And so I lived with data corruption and inconsistency because there was no one who could fix the problem. And when that person was hired, I had to wait for them to ramp up at Zynga. And when they finally ramped out, only then could I actually get them to work on the problem.

But finding the right person to solve a problem is the easy part of the job. Motivating them to solve the problem is the hard thing.

The really hard part is to motivate people to write the software. Remember people have lots of reasons why they do things. And people’s best work is done when they are fully engaged in a problem, when they show up wholly – mind and heart and body.

You don’t want extrinsic motivation, because you don’t get people’s best work.

And that means a bunch of things.

The simplest and most obvious is that people have to feel safe to be themselves. If people don’t feel safe, then they will not be there. They have to feel supported. They have to feel free to be their authentic self.

Screaming at people, dismissing people, being cruel, demonstrating how much smarter than them you are, trashing their work, is how you get something other than their best work. And sadly in my past lives, I had to have a boss explain this very simple thing to me. And I’ve had to be reminded of this on more occasions than I like.

The second is that they have to feel that what they want will happen. And what they want is not what you think it is.

A large part of the job as an architect is to spend time 1×1 with everyone and making sure that they are wholly engaged. And understand what they need. And everyone is different.

For example, a co-worker of mine was trying to re-architect a system, and he was running into flak from his team. And I asked him: Did you talk to everyone to see what they wanted from this effort? And he said, no. And I said: How can you convince people of something if you don’t know what they want?

So he scheduled a bunch of 1×1’s, found out what everyone wanted, and all of a sudden the flak evaporated. It didn’t evaporate because he listened to people, the flak evaporated because he adjusted his plan to meet their wants and needs.

Sometimes I get asked: Why do you spend so much time talking to people? And my answer is: People write software and I want people to be fully invested in a solution because that’s how they do their best work.

Do I always succeed? No. But it’s my North Star.

11 architecturalist papers: The area under the graph

April 19, 2018 by kostadis roussos Leave a Comment

A year ago, in a promotion meeting, a senior technical leader warned us about promoting someone too soon. And he and I rarely agreed on anything, and I always learned something from our discussion.

His comment is that to be a successful technologist you need to have an area under the curve not just cross the threshold. And that time and experience were how you accumulated that area under the graph.

At NetApp, in my 20’s, I was determined to become a technical director as fast as possible. And I crossed the threshold, and they promoted me, and then it took me one more decade to learn the actual job. I was 33 years old, and I was the second youngest technical director.

And I had to have a lot of failures, and experiences before I actually could do the job properly. And I had to learn a lot from people.

For example, one thing I had to learn and then re-learn, is that in a new market all of your instincts are wrong. And I also had to learn that in all markets some things are always true, there are no shortcuts.

What I realized is that the sentences and verbs never change, but the nouns and adjectives do.

A lot of the job of strategic software architecture is not about technology but pattern matching.

And the job is to understand the details so you can map the right sentence. And that was a lesson I had to learn the hard way.

For example, when I went to Zynga, I sat in meetings, and people were using words that I had never heard before used in ways that made no sense. In my first meeting with the executive team, they said: “We need new IP”. And I was flabbergasted – why would a game company need a new internet protocol? Except they meant “Intellectual property” which meant “new game franchise” which meant “new product” and that the real discussion was about how do we launch a new line of business?

The sentence, in this case, was “How do we launch a new line of business,” the noun was “new IP.” And once you realize that a game company spends all of its time launching new games, it makes you rethink what the purpose of software is.

The solution, again from experience, was that before I could do the job they hired me for, I had to learn about the technology. And so I spent several months taking notes on words and asking questions, and within a few months, I had begun to match patterns.

When I went to Juniper, after Zynga, I took the same process. First, learn the nouns, and adjectives, then do the pattern matching and then start figuring out what needs to be done.

Being able to do that job, the pattern matching, you need the ability and the experience, and so getting promoted fast is great, but it takes time to be able to do the job.

10 architecturalist papers: the 8 year itch

April 9, 2018 by kostadis roussos 1 Comment

The most intriguing part of the tech industry is the interplay between Moore’s law and opportunity.

My rule of thumb is the 8-year rule. The rule says the following; market disruptions happen every eight years because the incumbent software stack can’t be adapted to the new hardware.

As some examples…

My current employer took over the world because of the shift to multi-core servers. Apple took over the world courtesy of mobile processors becoming fast enough to run most useful applications. Microsoft and Linux took over the world when Pentium closed the performance gap between x86 processors and RISC systems.

But why 8, because we human beings, don’t understand exponentially growing curves.

For small enough values of x, y=x*x is equivalent to y=2x. The two start to diverge when x = 3 or when y = 8.

So?

Let’s assume big tech titan has 80% of the market in year 0. Then in year 2, the new hardware emerges, but it’s not appreciably faster, except for some small use cases, so instead of selling to 8/10 customers, it’s now selling to 8/12. Then in year 4, it’s 8/14, and then in year six it’s 8/18, and you go from being 80% of the market to less than 50%, and there is probably some other tech-titan growing much faster than you were and you are the has-been tech-titan?

Many books and articles cover this topic.

What does this have to do with anything?

When you consider strategic software architecture, the tricky bit is to navigate across that 8-year transition. And what makes it particularly tricky is that you have to assume that the software you had is going to be your boat-anchor, and simultaneously, your source of funding.

The challenge is that a mature software architecture that’s tuned for one market takes about 8 years. And the reason it takes about 8 years, is that any market takes 8 years to mature. And the reason it takes 8 years, is that it takes 8 years for the hardware to become capable enough to grow the market. And if your software is tuned for one market, it is not tuned for the next one.

How do you solve this?

The short answer is that you have to find the 8-year curve after the current one.

The strategy is the following:

Grow with the first 8-year cycle
Preserve market share during the 8-year cycle you missed and figure out what you will do next. Use this time to re-architect your system for the next grow opportunity.
Grow with the next 8-year cycle.

The tricky bit in my mind is to understand when you are in (2) and to realize that your goal is not to continue to improve what you had but to go build something new that builds on top of the assets you have in place.

Where I have failed in the past is not understanding the importance of (2). That it’s tempting to see some market you missed and try to attack it and repeatedly fail, instead of admitting you blew it and then trying to find the next thing.

So how do you set yourself up for (3)?

You have to think about where the puck is moving to, and then do everything while you are in phase (2) such that it lines up with where you think phase (3) will be.

And that’s the trick, to do the major re-architecting nominally for phase (2), but really for phase (3). And as a strategic software architect, that’s the hardest job, selling the future as an improvement on the present when the entire company is obsessed with a market they already lost, and with a market, they could win at not yet visible.

09 architecturalist papers: draw me a picture

March 31, 2018 by kostadis roussos 3 Comments

One of the enduring myths about software architecture and in general technology leadership is the degree of control an architect has.

Our advocates believe we walk into a room, draw a picture, everyone listens to us and then code materializes.

And I have worked in places like that. Teams willingly were lead off a cliff.

I remember a time at NetApp where a team just wanted me to draw the picture. And I did and then projects got spun up and engineers got assigned. And then I left because that’s not the way I work.

The best teams don’t work that way.

The best teams draw the picture for you and you evaluate if the picture makes sense.

What you want is people to be passionate and believe in their solution. And only very rarely is what they are proposing wrong. Most of the time it’s good enough.

And so my job is to find out how to nudge them away from potential disasters.

Sometimes it can be exasperating because it’s not the picture you want drawn. And sometimes the picture is a compromise between organizations and not the best software. And there is always someone smart enough to point out the better picture that no one had any passion for.

And then they look at me as a failure, because isn’t my job to build the best possible system and force it down my teams throat?

And the answer is almost never.

The job is to have product that can evolve to be the best product it can be. And to do that you need a committed team. And a committed great team will always produce great software even if the picture isn’t exactly what you would have drawn.

Because drawing the picture – ironically – is almost never the job.

08 architecturalist papers: the politics of fear and global warming and product development

July 13, 2017 by kostadis roussos 2 Comments

In 1978, I read a book about the Holocaust. The book was in my elementary school library. A school with a large Jewish population. And I was exposed to horrors that profoundly shake your faith in humanity. Read a book that describes the horrors of Nazi Germany at six, and your world will get warped.

In 1988, a Chemistry teacher at Campion School told my classmates and me that we were dead men walking. The human species was ultimately going to destroy the planet, and our civilization was done. We were 14 years old, and we were dead before our lives had even started.

In 1994, a very senior professor of CS walked into a room of CS majors and told us that our jobs were going to go away. That Indian outsourcing was going to eliminate our jobs. Only 13 people graduated in 1996 with a degree in CS because the rest of my peers took his warning seriously.

From about the mid-1990’s, a profound understanding of global warming made me appreciate that our actions killed our current civilization. Either a gentle transition or a massive collapse would happen. And my understanding of the human condition from my reading of the Holocaust, made me bet on the massive collapse.

Now in 2017, with many of the predictions about global climate coming out true, I look at children and wonder what kind of hellscape they will inherit.

And you think to yourself if you can’t do anything and you are fucked, then you might as well drink the coffee, hug the wife, play with the kid and sing gospels or chant Orthodox prayers.

The fact that this kind of despair has permeated my life makes me wonder why I am still alive, what force has propelled me to keep living?

Only one: hopelessness is not interesting.

Working in the tech industry, I have learned that you can not motivate people with hopelessness. If you walk into a meeting and tell a bunch of execs that we are going out of business, then they are not interested in what you have to say. If you say to that unless we do X, or Y or Z we are going out of business, they are still not interested.

Why?

Because every company goes out of business.

You are telling your business leaders that the sun will also rise, that the Universe will end and that they will be forgotten. You are giving them no new information.

And the reasons you can go out of business are so broad and varied and complex that this is just one threat in a spectrum of global threats that affect them.

Strategic Software Architects must be about hope. Our job is not to find the millions of reasons we will die; our job is to find the one way we can win.

And what makes the job so very hard is that we have to create the circumstances that allow us to win.

And here’s why I believe that.

In 2006, I was working at NetApp, and I was asked to produce an analysis of data center technology trends and application trends. And I walked into a meeting with my peers and observed that multi-core systems had created a strategic dead end for NetApp. The value of having external storage was to improve the performance of applications. And in particular, to deal with the fact that storage consumed a lot of CPU and Memory. By having external storage, you could improve the overall performance of your system.

Unfortunately utilization was going down on the servers (2-10%), and as a result, it was increasingly obvious that running the storage on the local system made sense. At the time Oracle and Microsoft were pushing hard for clustered file systems and databases that they felt didn’t need external storage arrays.

And I remember, saying in that meeting: We’re fucked, this was a nice company, time for us to look for a job.

A few months later, Tom Georgens asked Dave Kresse and me to study VMware and see what we could do with them. And I came to the same meeting and said: We’re saved! It turns out that VMware has figured out how to make this utilization problem go away.

And then that begat the: how the hell do we sell into EMC accounts NetApp storage?

And I remember sitting in a meeting with every business leader at the company explaining a very me too product strategy. And I remember everyone just staring at their laptops. Ten years later, I would have seen a whole bunch of LinkedIn updates.

And somebody asked me: Hey did you see how this guy saved 90% storage using NetApp dedupe?

And it clicked for us all. We had this feature, called dedup, that allowed us to deduplicate data on primary storage. And VMware had a problem that they needed shared storage to store identical images.

And what was incredible is that dedupe was the feature that we kept trying to kill. Originally imagined as an answer to data domain, or perhaps a generalization of snapshots, for years teams tried to kill it, and somehow it survived. This piece of unwanted technology transformed our company.

We shut down releases, redid roadmaps to take a piece of technology that barely worked and made it the centerpiece of everything we did.

We convinced the world that deduplication on primary storage was the right thing to do. Dedupe was a technology that no one had. Because it was insane. Intentionally introduce fragmentation. Dedupe on primary workloads was a crazy stupid proposition for storage.

And we won and lived for another decade.

The morality play, in my head, was the following that we could have curled up and died, I could have taken that interview at Google or I could have kept looking. And I chose to keep looking.

And because we kept looking, we found something, and we survived and thrived.

If you want to inspire people, don’t tell them they are dead, stare into the abyss and say we will find a portal out of here.

And you know what, you may find a way out of the abyss, and trying to find a way out is always way more interesting.

07 architecturalist papers: how micro-services made picking a programming language different

July 12, 2017 by kostadis roussos Leave a Comment

Once you become an operational or strategic architect, programming languages become an option in the toolbox. And then the question becomes which one to pick.

The most important considerations when I started my career were:

Can the language interface with pre-existing code
How mature and stable is the programming language
How many programmers can you hire that know the language
Does the language have a debugger and a profiler
What tradeoffs does the language impose regarding performance and safety and portability.
What specific libraries and tools and constructs does the language provide for making the project go faster

With large monolithic systems of the 1990’s, #1 forced you to keep the same programming language indefinitely. Unless someone signed up for a rewrite, you had no choice. Even in the case of a rewrite, you always wanted to leverage some of the pre-existing code.

And in the 1990’s the most important piece of code you had to leverage was OS system services.

Microsoft and other programming language vendors attempted to invent program language technologies that allowed applications to call from one another, but the tools didn’t quite work, and they locked you into a specific vendor and OS. The first C compiler I bought from Microsoft in 1987, had a long discussion of how you could get basic and C to work together.

And you still had the whole problem of cross-OS portability.

Attempts at standardizing libraries through things like POSIX didn’t work at all.

What changed in the 2000’s was the movement to multi-process application architecture using databases as a mechanism to exchange data. The database was cross platform, and vendor neutral and language agnostic. And all of a sudden, choice became an option.

And more importantly, C++ was a real option because it was designed to solve #1 and #5.

In 2004, when I had an opportunity to pick a language as the operational architect for the NetApp Performance Advisor, I chose C++.

I agonized over the decision for a month, because – based on my prior experience, this was a once in a decade decision.

And the reasons for C++ were:

C++ could call into all of our C code.
C++ was much more mature than Java at the time
C++ programmers were easy to hire, and C++ programmers could work on the C parts of our system.
Working debuggers and profilers
Allowed us to trade off some performance for safety (string class instead of char*, and reference counted pointers) with no loss of portability across the platforms we care about.

My old school thinking decided that the Java Native Invocation was just too clunky as a mechanism to leverage our existing C code base. And I had spent a lot of time in college writing C++ wrappers and had no time to do that for the huge Data Fabric Manager code base.

Furthermore, I remember thinking that C# would crush Java because it made calling code from C# into C/C++ easier…

Except…

Java, in the end, won, because the service architectures were just a better way to write software, and using the database as a way to get different parts of the system to talk to each other made it easy to add Java to a system.

Applications no longer needed to call into libraries, they could use the database to share information. I believed that this was a dead end architecture because database – increasingly became a bottleneck, and that would lead to large monolithic Java systems and that would lead to a stable Java dominated ecosystem except…

The emergence of SOAP and JSON and REST made it even easier to combine programming languages and circumvented the DB bottleneck.

In the 1990’s picking a language was a once in a decade decision, now picking a language became a pretty standard decision. And that leads to a new problem for strategic software architects, given that operational architects can pick any language at any time, what guidance do you give them?

In short, the basic model still holds, except for one minor tweak:

Can the language interface with pre-existing code
How mature and stable is the programming language
How many programmers can you hire that know the language – and how easily can they move between languages.
Does the language have a debugger and a profiler
What tradeoffs does the language impose regarding performance and safety and portability?

And that last tweak is hugely important. Each service has a life span, and engineers have to be moved between services as business priorities change. And having different languages creates friction in your ability to move people.

At Zynga, Cadir was adamant that we only support two backend programming languages (C and PHP and Java) because he valued the ability to move engineers easily very highly.

I, personally, loathed PHP and found C to be too low-level and could never quite get over my initial interaction with Java in 1994, but he had a good point and fully supported his decisions. One time, I forced a team to rewrite their Ruby code in PHP because of our policies.

And Cadir’s decision was a huge win because it people could easily move across the company, and sharing code was very easy.

And this leads me to the conclusion that the real list now is:

How mature and stable is the language including debuggers and profilers?
How easily can programmers move between services
What specific tools or constructs does the language provide for the domain problem.

You’ll notice that this, the most important property in 2004, got dropped,

Can the language interface with pre-existing code

Because that’s basically a solved problem.

06 architecturalist papers: musings on strategic, operational and tactical software

July 3, 2017 by kostadis roussos Leave a Comment

(1)

In a fun thread on Facebook, where a bunch of my fellow architects and I interacted, a question came up:

What is this operational thing you keep talking about?

And well, I thought it might be worthwhile to define.

I borrowed the terms from military games and my long-standing fascination with the first and second world war. Generally speaking, a tactical engagement is one involving a small number of soldiers that is part of some battle. An operational engagement is a large battle like the Battle of Kursk or the Invasion of Normandy. A strategic engagement is the Liberation of Europe or even grander the Defeat of the Axis. And the lines I just drew are not as simple as that.

For me, the key points are that the bigger the scope:

the more the resources are available
the more impact decisions of the past have in the present,
the bigger the stakes.

Using that mental model:

Tactical software architecture is a bug or a feature spanning a single release
Operational software architecture is about a product spanning multiple releases
Strategic software architecture is about multiple products spanning multiple releases.

Using that mental model, we can answer the question that was posed as a follow on:

When do I pick a new language or stack?

Definitely out of scope for a tactical software architecture. You’re incrementally improving the product within a bigger operational software architecture, and you should use the tools provided.

That’s an operational software architecture question. The overall strategic software architecture may constrain some choices. However, an operational software architect should have some degrees of freedom.

At the strategic software architecture level, there is never one language or stack. Instead, there are questions of how many, and what are the preferred and what are the rules for adding or removing, and how do they inter-operate.

And on-prem vs. SaaS?

As a strategic software architect, on-prem software limits architectural choices at the operational level. Each new stack has to be as mature as every pre-existing stack, and that means you keep using the same stuff that someone picked years ago. The challenge of adding a new stack is the underlying reason why on-prem software suffers from periodic rewrititis.

For SaaS, on the other hand, adding new technology stacks is easier. Unlike on-prem, where the entire product suffers if a service is bad, in SaaS, the impact is more constrained. And this gives operational software architects more authority and strategic software architects more freedom. The flexibility software architects have is, why SaaS platforms don’t go through rewrites – although individual services do.

(1) The last great operational tank battle, the Battle of Kursk. Map By Bundesarchiv, Bild 101III-Zschaeckel-206-35 / Zschäckel, Friedrich / CC-BY-SA 3.0, CC BY-SA 3.0 de, https://commons.wikimedia.org/w/index.php?curid=5414021

05 architecturalist papers: the rules of rewrites

June 26, 2017 by kostadis roussos Leave a Comment

Over the last 20 years, I have been involved with or lead several rewrites of large systems. And that experience has taught me some basic rules of how to successfully do a rewrite. And they are pretty simple.

The business reasons have to make sense to every engineer and the business teams and everyone has to believe them.
The business leaders had to be committed to the rewrite. The bigger the scope of the rewrite, the more important the business leader. If you are rewriting the entire product, then the CEO has to be committed.
Once the plan was agreed upon, the entire team had to be working on the plan, and we had to deliver business value as soon as possible.
Optimize for the right strategic software architecture over the right operational or tactical software architecture.

The first rewrite I was peripherally involved in was a project at SGI called Cellular IRIX. Cellular IRIX was an attempt to build a multi-kernel single memory address space system to support SGI’s new cc-NUMA architecture.

The rewrite failed because there were too many constituencies who opposed the rewrite and too many business leaders who didn’t understand the value leaving too many engineers confused about why they should sign up for this.

When SGI imploded and some key engineering directors got fired, the project died a miserable death.

The second rewrite I was involved in as an individual contributor was the rewrite of the NetCache product and that was a successful rewrite.

The original version of NetCache was a port of software written by Internet Middleware Corporation. The system suffered from some flaws, flaws that were very hard to fix. If my memory serves me well, it was built on a callback-based system that had unclear and uncertain rules about how resources were being managed. There were if-statements littered throughout the code where function writers were expected to understand all paths that could result in a message with the state calling their function. In effect, to understand how to write a leaf function you needed to understand all possible states of the global system.

Although a technical mess, the reasons the rewrite succeeded have little to do with that. The technical mess and the near impossibility of fixing the technical mess were not why we successfully rewrote NetCache; it’s why we wanted to rewrite NetCache.

We were able to successfully rewrite NetCache because the performance and availability and feature velocity were a serious business problem and there were no alternatives on the table, and yet that wasn’t enough.

The real reasons are that

Everyone was working on the rewrite.
We knew that our new hardware could not be fully utilized using the old architecture
There were no constituencies in favor of fixing the old code base. The product managers didn’t want, the engineers didn’t want it and the customers didn’t want it.
Everyone from the GM down was committed to the rewrite
The code base caused business problems everyone understood.
The investment in the rewrite was 2x the total man years in the original code base.

After the NetCache rewrite, the next failed rewrite was of ONTAP following the Spinnaker acquisition.

NetApp decided to buy a great company with a great engineering organization, called Spinnaker. Spinnaker had a clustered file system and an embedded namespace virtualization and worked. Unfortunately, post-acquisition we failed to properly deliver an integrated product. Instead, we delivered a new version of ONTAP, ONTAP-GX that was not compatible with ONTAP. And then after we delivered that product, started another attempt to re-integrate the Spinnaker technology into ONTAP, an effort that became known as ONTAP 8.0

There were a lot of reasons why this effort failed, and I was in the periphery of the effort. And it’s tempting to point fingers at people I respect a lot. And so I won’t. Because tactical or operational considerations of the failure are of no interest to this blog. What is of interest, is what was the strategic software context that doomed this effort from the get-go?

At the time of the acquisition, the kind of performance and scale Spinnaker offered were of interest to a tiny part of the overall market. More importantly, the kind of virtualization Spinnaker offered – namely global namespaces was of less interest.

By 2004, the entire storage industry had decided that the FC and it’s evil step child iSCSI was the most important protocol on the planet because structured data was the growth business. And structured data lived in databases.

And NetApp’s business problem was to figure out how to address FC and iSCSI not how to insert namespace virtualization. And more importantly – at the time – there was little interest in core storage innovation. Dave Hitz, the EVP, and founder of NetApp, wrote a future history paper that said as much.

NetApp’s business problem was how do they sell more of what they had to more different customers, not a new storage array.

And every engineering manager, director, and engineer knew this. And so we had a constituency at NetApp develop that said – sure I’ll invest in this new system and at the same time, I will invest in the old system. I know about this alternative plan, because as an architect for the storage management team, I was 100% aligned to wait for the ONTAP rewrite to fail.

The original plan called for all of the engineering team to pivot to the rewrite. And then the pivot didn’t happen. And the pivot didn’t happen because when problems occurred with our core business, investing in a rewrite that wasn’t business critical became a luxury. And so the rewrite got starved of resources.

At the core, if I can use 20/20 hindsight, the mistake at a strategic software architecture perspective was the decision to have two teams, one working on the original product and one working on this rewrite. The core team felt that they were stuck doing sustaining and the rewrite team didn’t have enough resources to compete.

And the core manifestation of this separation was that the new ONTAP had a different build system and source code repository than the original ONTAP.

Much like Cellular IRIX, there were too many people looking outside in, and trying to find ways to cause it to fail.

The company more or less figured this out after ONTAP GX failed in the market. They realized that the only way to go forward was with one ONTAP that the entire company owned making it more cluster aware. And it didn’t hurt that by then; people felt that that the kind of storage system that ONTAP 8.0 was building was something that could be market leading.

The rest, as they say, is history.

The next rewrite I was involved in was the rewrite of Zynga’s backend, something I lead.

The business goal of the rewrite was quite clear. Zynga wanted to be in the 3rd party market and to do that we need to offer 3rd party games services. We had services, but the API for using them were PHP libraries, and that wouldn’t work. So we had to offer services that they could consume via APIs from their apps that didn’ t have to be in our data center.

And so we decided to take all of our systems and export them over a network API to the world. And this became Zynga’s 3rd party platform that lived a short life, but the API infrastructure turned out to be very valuable for something else, and that was mobile gaming. Later on what we built got heavily modified and re-architected, and I like to think that initial effort, a project called Darwin lead the way.

From my lessons of ONTAP and NetCache, I took three key lessons

There had to be a compelling business value that was non-technical.
The business leaders had to be committed to the project, not the technologists. There could be no way for someone to appeal to some exec higher in the chain to reverse course.
Once the plan was decided, we had to go very hard very fast with the entire team and deliver business value as soon as possible.

And that’s more or less what we did. I remember sitting in a room with Cadir, the CTO, and owner of all central engineering, where we made the decision to do this. And I told him I need 18 months, and he told me I had six. And then I remember telling him we needed everyone to work on this or not to bother at all, and he said – let do this. I continue to admire him for that level of commitment.

And to make sure everyone knew he was personally committed to this effort, Cadir lead an all hands where he personally said that this new effort to rearchitect the backend was something that he was committing the entire team to do.

We then had a series of architectural meetings where we figured out how to solve the fundamental problems of making our services available to the internet. At the time, we chose systems that were already in existence over writing new ones. Our goal was to have a working system in a quarter and something we could sell in six months.

Those meetings included every architect and constituency, and ultimately decisions were made that pissed people off. Some that were painful for me later on in my career, as I ran over people when I could have just listened and taken them along for the journey.

We did succeed in an environment where people were quitting post-IPO and then quitting because our business was suffering.

And by success, I don’t mean what we shipped was the right software, I mean we shipped the right software architecture. And the organization was reorganized around the idea of delivering services through APIs that were centrally managed and accounted for.

And that brings me to rule #4 – Optimize for architecture not for software implementation.

Because Cadir forced me to go fast, we couldn’t figure out the right thing to do in all cases. He forced me to prioritize what we needed to accomplish and what we needed to accomplish was get the APIs on the internet fast, not have the best pieces of software to do that.

The point he taught me and made clear is that we can always improve a shipping product, we can’t improve a project that got canceled. And that if the architecture is put together right, then the bits can be enhanced over time.

And because architecture and org structure are the same things, once you reorganize to implement an architecture, you can quit the company. The company will naturally improve the architecture if the business problem remains. And that happened.

04 architecturalist papers: I don’t want to be in the room where it happens

June 21, 2017 by kostadis roussos Leave a Comment

My favorite song in Hamilton is the “Room where it happens.” And my favorite part of the song is this part:

I
Wanna be in
The room where it happens
The room where it happens
I
Wanna be in
The room where it happens
The room where it happens

And the reason is that I had a similar experience in my career, which served to drive all of the rest of my professional success.

It was in 2002, at NetApp. Chris Wagner was the CTO of the NetCache product group and called a meeting of all of the senior engineers who worked on NetCache. And I wasn’t invited. And I remember standing outside of that room, looking in and wanting to be there, inside.

And for the next several years, I struggled to figure out how to get into that room. And I succeeded, and only after I succeeded, I realized I didn’t want to be in the room. Because I realized that the room where it happens wasn’t the room I wanted to be in.

What I wanted to be was George Washington.

See when Thomas Jefferson, James Madison, and Alexander Hamilton walk into the room, they are debating options that George Washington was okay with. George Washington had created a strategic framework that they had to operate within.

For example, if George Washington wanted Alexander’s plan to come to fruition, he would have pushed for the plan himself instead of sending his annoying right-hand man to negotiate with James Madison and Jefferson.

Similarly, he didn’t care if the capital was in New York or Virginia. If he had cared, then the topic would have been resolved much earlier, with Washington’s intervention.

In short, George Washington gave the folks in the room where it happens a set of choices that they could make, and he was okay any decision they made.

Strategic Software Architecture done right is about ensuring that any tactical, operational software decision is immaterial and doesn’t affect the long-term strategy allowing individuals to make choices on-their-own that still ultimately produce the right final outcomes. In effect, every decision that gets made is one you are okay with, so you don’t have to be in the room where it happens.

13 architecturalist papers: getting to mainstream

Like this:

12 architecturalist papers: people write software

Like this:

11 architecturalist papers: The area under the graph

Like this:

10 architecturalist papers: the 8 year itch

Like this:

09 architecturalist papers: draw me a picture

Like this:

08 architecturalist papers: the politics of fear and global warming and product development

Like this:

07 architecturalist papers: how micro-services made picking a programming language different

Like this:

06 architecturalist papers: musings on strategic, operational and tactical software

Like this:

05 architecturalist papers: the rules of rewrites

Like this:

04 architecturalist papers: I don’t want to be in the room where it happens

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: