kostadis roussos

51 architecturalist papers: transactions and arrangements and architecture reviews

September 22, 2021 by kostadis roussos Leave a Comment

One of the best program managers I have had the chance to work with observed that companies have challenges when transactional-based decision making and arrangement-based decisions come into conflict.

Transactional-based decision-making is what I call strict enumeration and documentation in the form of a contract. Once both sides agree on the document, then even if people change, the decision remains. The trust is in the record, and the change controls surrounding the document. This kind of decision depersonalizes the decision. These decisions tend to be transparent.

Arrangement-based decision-making relies on trust between the two parties. The parties establish trust in various ways, but typically it’s about making sure that the personal goals between both leaders align. Arrangements are remarkably durable and transportable. Meaning that once you establish an arrangement with someone, trust is the basis of subsequent decisions much faster. It also means that decisions without going through a complex process. The problem is that these decisions tend to be opaque.

Both models taken to extremes can be a disaster. I have worked in both. And they both suck.

The law is THE example of a transactional system. Overly transactional environments turn into bureaucratic, legalistic environments.

A pure transactional system can turn into a totalitarian state. A great example is – “the only way to change this spec is to call a meeting of the change control board and submit a request.”

A pure arrangement system can turn into a cult where belief in the great leader is paramount and the phrase “blah said” is used to justify or denounce everything. That disagreement with the great leader is an unforgivable sin. A great example of this is when someone says – “but SR ARCHITECT FOO said”. Or alternatively – an old-boys network – that is impenetrable.

I’ll also observe that both approaches feel natural to different people. My personal experience is being Greek and Canadian and living in the USA. As a Greek, I believe that laws are suggestions. That relationships, and in particular, family relationships, trump everything. As someone who lives in Sunnyvale, CA, I have learned that in the USA, laws matter, but that the laws are structured to satisfy arrangements among the wealthy.

A transactional system feels like a spectacular waste of time because personal relationships trump everything and that, ultimately, transactional systems are a facade for arrangements.

But I have learned that that is naive. The critical flaw of arrangements is the opacity of building trust and the boundaries of trust. Transactional-based decision-making creates a public record of the moment trust was built and describes the trust boundary. Without such a record, trust-building naturally devolves to family, culture, background, and other attributes.

This brings me to how I think about architecture reviews. The purpose of the review is to create a trust boundary between the approver and the author. The more high-level the spec, the more trust that has to exist. That trust is typically built over a series of smaller successes or previous professional successes. The more detailed the spec, the less trust that exists. A key trust moment as an architect is when you are willing to stop being the sole approver of an area.

The key illusion in all of this is that as an architect the spec actually defines what is built. The reality is that unless you are writing the software, some other human being is going to take that document and do what they think the right thing is. My job is to make sure that they are thinking about the problem in a way that aligns with a reasonable solution.

Decrying arrangements because they are old boys networks is wrong. Decrying transactional-based systems because Process is also wrong. Like most everything in life, there is a balance. And like most everything in life, navigating that balance is the art of living and being a software architect.

50 architecturalist papers: portability and the multi-cloud

July 18, 2021 by kostadis roussos Leave a Comment

Over the last 15 years, I’ve been noodling about portability. And the more I noodle about portability, the more I think it’s very poorly defined.

So, let me try.

In the context of computers, there are two flavors of portability, data path portability, and control plane portability. The data path is the bits of code that do real work. The control path is the code that sets up the data path, monitors the data path, and takes corrective actions.

In that weird way that computers work, the control path itself is just another data path. And so – in a bizarre way – there is only data path portability. Simplistically, the control path is instructions that execute on processors, store data in memory and storage.

And although Paris Kanelakis would be pleased to know I learned something in CS051, things are not- quite that simple in the world of commerce that I live in.

Except when they are.

The VM encapsulates a traditional single VM applications control and data path. And the VM, as a result, is a highly portable abstraction because the hypervisor can ignore which instructions are from the control path and which data path; to it, they are all data path instructions. And we know that if you don’t care about performance, any data path can simulate any other data path.

So far, so good.

VMware demonstrated that the portability of a VM is very valuable because you can move it between generations of hardware, move it around to utilize the efficiency of your infrastructure better, and because it’s a convenient way to manage workloads.

And so, the VM abstraction and its portability had another interesting effect. You can hand over the infrastructure operation to another person.

But while I was at NetApp and Zynga, there were all sorts of other portability that people discussed. This data path portability was nice but not sufficient, and it wasn’t enough if you couldn’t encapsulate the entire app into a single VM.

So sure, I could move the VM around, but if I didn’t have the control path, what value was the VM?

And so there was this thought process that said, “VM’s are not that interesting.” are

And I looked into all sorts of interesting programming languages and tool chests.

And yet, along the way, something funny happened. If I cared about performance, I cared a lot about the hardware I was running on. And all of a sudden, the behavior of the hypervisor matter a lot.

At Zynga, when Amazon changed their NIC, our costs skyrocketed. Their decision to optimize their business almost tanked ours.

But performance, who cares about performance? Right? Except, here’s the rub, when you don’t have pricing power for a feature, and you sell the hardware that the feature runs on. The only way it can be worth implementing is to implement the feature without meaningfully increasing the hardware costs.

And then the performance of the feature matters a lot. And some features don’t get implemented because there isn’t an efficient way to do that.

But if you can’t charge more money, why implement the feature? Because – here’s the rub – your competition is adding features very fast. And if you don’t stay ahead, the whole SaaS model suggests that eventually, your customers will move. So as a vendor, you have to keep adding features while keeping the cost of the features in check.

Hoom. So – wait, if I want to grow my business and add more services, and can’t meaningfully increase the price, then, erm, the performance of the feature matters a lot.

And yes, you can make code more efficient and thoughtful, but at some point, you start caring about whether the disk is an SSD or whether the NIC has enough buffers, or what exactly is the CPU you are running on.

I mean, if people didn’t care about these things, those things wouldn’t exist.

Okay, so?

Well, once you start getting to the point that you care about performance that much, then the hypervisor details matter a lot. But surely you jest? Hypervisors make choices, and those choices impact the way your applications run. And if you design an application to run on a particular hypervisor, then the portability of your application is the portability of that hypervisor. If the hypervisor only runs on hardware X or Y, or Z, any performance improvement locks you into that vendor.

To put it differently, if the control path of the hardware is in the hypervisor, and you need to control the hardware, you need to control the hypervisor.

So what? It turns out that the hardware vendors are adding more hardware faster than that integrated solution vendor is sharing. And so, being able to have some degree of portability to run on other hardware will matter a lot.

Because why? Because the more it costs to run a feature, the less money you have and the more money the other person has.

Hoom.

Baremetal! BAREMETAL!

I hear you say. I was cheeky. See, every modern OS is a hypervisor of some kind. When you run on a public cloud, you can’t control the hypervisor.

And in some ways, not to be annoying, running your app in a VM on a cloud is like trying to tune your Java code – optimizations can be done, or you can use C or Rust or C++.

So when I hear bare metal, I don’t hear “no hypervisor”; I hear – – run a hypervisor of your choice on hardware.

BUT BAREMETAL IS HARD. I agree.

So?

Well, here’s where things get very interesting. Suppose you had a portable hypervisor that existed, and you didn’t have to do the hard part of managing the hardware but could control said hypervisor to your heart’s content?

Why then, for those applications where the data path was critical, you wouldn’t be tied to the cloud vendor’s hardware choices.

Data path portability is about being able to run your VM anywhere you want and be able to control the hypervisor.

AHA!

So for the essential thing that can move my costs, I need to move my VM to whatever hardware I want and tweak the hypervisor any way I want.

But then why has the public cloud won so much business?

Well, for the same reason, Java has won over C++.

It’s just got a better control plane for application development. Java makes so many things easier.

And really, that portable hypervisor was painful to use compared to the cloud.

So hypervisor portability was nice, but it was uninteresting if you couldn’t run the control plane. And you couldn’t because it took a lot of time and money to build, and the value of doing that work was marginal. Except when it wasn’t like for Dropbox. Or at Zynga, where we designed our control plane to be portable and had options when Amazon decided to optimize their business at our expense.

But the control planes aren’t tied to the hardware platforms as much.

What does that mean?

Well, it means that for some workloads where the costs matter, using the non-proprietary control planes so I can use the portable hypervisor may be a better choice.

And well, a portable control plane is – to a hypervisor – just another data path.

REPATRIATION!

No.

No software service that exists and increases in value hasn’t been optimized and improved over time. The original S3 relied on MySQL databases, for crying out loud, and Facebook used a single NetApp filer.

Thinking about this as “repatriation” is the wrong mental model. The suitable model is that critical services will get optimized, and at some point, the optimizations will care about the hardware. The existence of portable control planes and portable data planes will be fascinating.

But I have to build my data center!

Nope.

See, those portable hypervisors are pretty much available as a service now.

And that leads me to the final round of thinking.

When application control and data plane portability is possible, will the ability to cram a small number of hardware configurations in data centers be that valuable?

A better physical working environment and the WFH

June 25, 2021 by kostadis roussos Leave a Comment

Watching how many of my coworkers don’t want to go full-time back to work got me thinking about the physical work environment.

Pre-Covid my home office was a laptop with poor ergonomic characteristics on my kitchen table and my car.

My work office, an 8×12 area, shared with a co-worker, had a 32″ monitor, a desk that could move up and down, and a perfect chair and was just a fundamentally more pleasant place to work.

Post-Covid, my home office is a 40×40 room, with a 32″ and 27″ monitor, a desk that can go up and down, a perfect chair, a nice couch, and is quiet.

The effect of Covid was I invested a lot of time and effort to create a great working area at home that no company in Silicon Valley can afford to replicate.

Being forced to come to work in my old office would be going from a really nice office to a worse office – my car and the workspace in my HQ.

If I wasn’t an extrovert, and loud thus irritating my family, I don’t know if I would want to go back to the office.

I am very fortunate in my workspace at work. Some of my coworkers have much smaller and louder workspaces. And those smaller and louder workspaces were created to pack more people into the office because that was just how we did things.

And this got me thinking further.

Pre-COVID, workspaces were a perk that companies could hand out as a way to reward employees. Now employees, having built much better workspaces at home, are rebelling at the idea that they must trade their home workspaces for the qualitatively worse office space.

48 architecturalist papers: system efficiency vs features and the multi-cloud debate

June 20, 2021 by kostadis roussos Leave a Comment

Over the years in my career, I have seen the tension between – “MOAR FEATURES” vs. “MOAR EFFICIENCY’.

The more features camp tends to view everything through the prism of – add more features to find more customers. Therefore, every business problem is a feature problem. And any activity that doesn’t result in more features is a waste of time.

The more efficient camp tends to view everything through the prism of – if the software and the construction of the software were more efficient, we could save money and be more productive.

The pro-features folks tend to argue that growth is the solution to every efficiency problem. If you have more money, you can spend your way out of any problem. And so the goal is always to be making more money. The pro-features camp also argues that there are many ways to be more efficient, including hiring cheaper engineers.

This Manichean view of the universe is how we arrive at Product Manager type CEOs arguing that we shouldn’t invest in anything that detracts from feature velocity.

Strategic software architecture must take a different point of view.

Until a product has achieved product-market fit and viability past 18 months isn’t assured, any investment that pays out after 18 months is of questionable value.

Once a product-market fit has been achieved and viability past 18 months is assured, efficiency investments effectively extend the business’s runway, return the margin to the business, and allow the company to generate more cash per dollar spent.

And the point at which you pivot investment is a delicate one.

The pro-feature camp controls the agenda. And they will push hard against any notion that efficiency investment should trump features.

The pro-efficiency camp being overruled for so long will either not exist or have no ability to drive investment decisions.

Worse, the pro-feature camp, by their nature, tends to view any outcome that is further out than 18 months to be very speculative. And the pro-efficiency camp views the value of their work in a 36 month time horizon.

In this battle, the short-term always wins.

In this struggle and debate, my attitude has always been that efficiency must have its own agenda but must sell it to the pro-feature camp. In effect, every efficiency improvement is really a feature.

But to do that, you have to sacrifice the big bang for small incremental wins along the way. In effect, when the business wants a feature, you write code that also improves the efficiency of the system over time.

And, of course, it’s not that simple.

You also have to pick the right technologies. To deliver the biggest wins, you need to be able to use best-of-breed technologies where there is a competitive market, and you need to be able to swap out technologies easily.

Huh?

Let’s take multi-cloud, for example. One argument is that making your application portable is foolish because it detracts from the core business, which is feature delivery. The argument is typically presented in the form – “private cloud very hard, public cloud easy, single cloud easier.”

And there is some truth to that. And the ease comes at a cost. And debates ensue.

I see it very differently. There is this massive competitive market of infrastructure providers. That infrastructure is a commodity except in the rarest of cases. As a technologist, I enable the business to be more efficient by allowing it to take advantage of that commodity infrastructure through my software choices. Customers of infrastructure will choose because of externalities—things like laws, vendor relationships, and personal taste.

Pre-k8s, this argument was slightly harder to make because there was no obvious API layer that would allow you to decouple your application layer from the common infrastructure, and the market wasn’t very competitive.

In fact, in 2015, at VMware, I argued that such a layer would have to come into existence, or AWS would own the world. So our job at VMware was to stay alive until such a layer emerged. I remember saying to folks – this thing would emerge miraculously, and when it did, we would need to jump all over it. I remember folks looking at me as if I was delusional.

Well, k8s arrived and was exactly that layer, and we jumped all over it. K8s was breathtakingly brilliant because it was the right choice for developers and infrastructure providers.

And the reality is that just picking k8s already makes your SaaS system more portable. And it’s an easy choice and gives you portability from day 0.

But it’s more than just k8s. Picking a managed Postgres over a custom DB makes your application more portable.

Incremental choices that over time add up so that when you want to make bigger bets, the cost of making those bets is small.

And the real bet is that over time because the market is now competitive, the infrastructure vendors will create an ecosystem that makes multi-cloud easier and private cloud easier.

In fact, I believe that a single cloud strategy advocates ignore how k8s is creating an API layer that enables innovation and competition in all layers simultaneously. And it’s not the only layer. Amazon, for example, with i3N metal instances, created a competitive market for hypervisors in AWS.

The anti-multi-cloud folks point to dropbox as a potential failure because of the complexity. On the contrary, I would argue dropbox is an example of how hard multi-cloud is without the right layers creating an opportunity for competition everywhere.

The anti-multi-cloud folks’ argument also turns to – well, java. The argument is that Java attempted to create a sealed abstraction that hid all infrastructure from the applications and failed.

And they have a point. But the goal isn’t the ability to run the application without change on any cloud; the goal is to make the cost of doing that so small that the efficiency gains can be realized. The goal is to make it easy to force the infrastructure vendors to give up some of their margins to the application.

In this picture, you can see what I mean. In the initial state, the application is using only proprietary APIs. Over time, the applications that are being developed have access to standard APIs. The existence of those standard APIs and a competitive market gives choices. The proprietary APIs never go away; they just become a smaller part of the application. And the most important part is that the goal is not to use custom APIs; the goal is not to let custom APIs hold you hostage, to a vendor. And the other point is that it’s multi-cloud; if you want to use custom APIs, use them.

The emergence of that competitive market is creating innovation right now.

And as a result, forward-thinking venture capitalists and forward-thinking architects will choose technologies that will facilitate the emergence of multi-cloud because it is in their selfish interest.

47 architecturalist papers: talent and team over location, always

June 12, 2021 by kostadis roussos Leave a Comment

The recent backtracking of Google, Amazon, and Apple on remote work is fascinating. For companies that are supposed to be on the bleeding edge of corporate governance, the decisions that these folks made not to allow remote work was one that makes me want to ask – “show me your work that leads to your decisions.”

In that spirit, let me show mine.

In 1994, Andries Van Dam got in front of an undergraduate computer science class and announced that the idea of all software engineering remaining in the USA was foolish. As software became more about composing components than building everything from scratch, and because any software engineer anywhere can construct a component that any other engineer could use, jobs would be offshored. In effect, he argued that there would be fewer jobs in the USA.

It’s easy to laugh at Andy now. But in 1994, the bulk of software was for military-industrial uses. The worldwide web had not yet come into existence. And thanks to the Peace-Dividend, jobs were being lost everywhere in tech.

Many years later, he and I caught up, and we remarked that he was right; what he didn’t account for was an ever-increasing pie. In other words, both statements could be true; global teams delivering components would happen, and more jobs writing software would exist.

Post-dot-com boom, the fetish in the tech industry was this concept call off-shoring. The thesis was that companies could find good enough talent at lower cost-locales and thus reduce their costs.

I remember being in a meeting where some GMs declared that we would only be hiring in India, convinced me that Andy’s apocalyptic vision was right, and plan for a backup plan for my software career.

Waiting on a Green Card and having no other realistic choice as the nuclear winter of 2001-2009 took hold in the tech space, I had to learn how to work with remote teams.

In 2002, I made my first recruiting trip to India, where we brought up a team whose job was to take on NetCache. It was really a case of training my replacements. Being young and foolish, I don’t think I really understood what was going on, but it was pretty grim.

In 2003, I had another GM show up in a meeting and explained that we needed to offshore more work to India, and my project on storage management was to be sent to India.

But then something funny happened.

We started hiring in the USA again.

And then we acquired other companies in other parts of the USA.

And then, I suddenly found myself leading teams in Bangalore, Palo Alto, North Carolina, and Pittsburgh.

At one point, I was on a project where everyone was remote, and I wondered why the heck I was coming into work?

From about 2004-2009, every meeting had someone dialing in remotely. Every project had a remote engineering team.

Then I moved to Zynga, and I swore I would never work with remote teams again, and yep, I was working with teams all over the world.

Then I arrived at VMware in 2015, where remote work had not been widely adopted across the company.

But then, from 2015 to 2019, my partner in crime and I decided, that talent not location, mattered. So we hired folks anywhere. Our architect for our biggest technical achievement was on an island in Puget Sound, and the other one is in Humboldt county.

The location didn’t matter. We would hire anyone anywhere.

After the Heptio acquisition, I remember talking to some VP, and he said – “well, it’s going to be hard to meet because the Heptio folks aren’t in PA.” And I was like, “oh, if they were, that would suck because the guy they need to talk to is in Humboldt county and a beach house in Washington State.”

If I looked at my team, I had tech leads in Bangalore, Austin, Washington State, Sofia (Bulgaria), and Humboldt county.

We succeeded as a team because we didn’t care where you lived. And we figured out how to work together even if we didn’t have physical proximity.

And so over 17 years, I have been working with remote teams, and those teams have delivered staggering business value across three different companies and three different industries, including a hardware company.

The idea that remote work is worse than local work is a failure of the imagination of business leaders and management leaders.

But why?

Because here’s what I learned, if X units of work need to get done, and individual Y wants to do them in location Z, then X units of work get done if you let them do it from location Z. If you don’t, the X units of work don’t get done until you find individual i in the desired location R.

So the loss of productivity is measured in the following way:

For everyday d that some engineer E is not working, X units per work per day (x/d) do not get done, or P(E) = x/d

If D(i) is the number of days that it takes to find the imaginary i, then the lost productivity is P(i) = x/d*D(i)

If time to market matters, and days are precious, then the value of the lost days is huge.

So when I hear leaders tell me that agility matters, speed matters, and they are unwilling to let people work from where I must admit, I have a hard time believing their math. And like my wife, who is a math teacher, says, I want to say – “show me your work.”

But then I hear startups or small teams say, but small teams.

And that got me thinking.

In 2003, when my team was only 8 engineers, we hired a team in Bangalore.

In 2010, when Zynga was growing, I couldn’t get recruiting to give me the time of day. I had 5 engineers in San Francisco, and the Chief People Officer, Colleen McCreary, asked if I was willing to hire in Seattle. And so we grew that 5 person team very quickly, and they built the infrastructure that Zynga used over the next decade.

In 2015, when hiring was very hard at VMware, I had nobody in Seattle, and when a very senior engineer showed up, and others were wondering how to make him productive, I said – yes.

Every single time, finding talent was the hard part. Dealing with remote locations was easy. And then, because I was willing to go to remote locations, finding more talent became easier.

Talent matters. Location doesn’t.

And that leads me to the following conclusion – if you can’t work with remote teams, then you might want to look very carefully at how your team works.

46 architecturalist papers: the hybrid cloud

June 8, 2021 by kostadis roussos 8 Comments

In 2013, after being pushed out of my role at Zynga, I was noodling on what to do next. And was thinking about working on hybrid cloud technologies, but I figured that with everything Zynga had achieved, this was an area that was well-trodden and not one where a new startup was welcome.

Wrong.

So let me rewind the clock.

Here’s what I had learned at Zynga.

The public cloud is costly, and a well-run private cloud can save you tons of cash when you are short on cash.
The public cloud is a very efficient way to get something started.

In many ways, it’s the basic tension between using a prototyping environment or language like python versus a more statically typed language like C++.

And that creates a fundamental dilemma, how do you get to the private cloud economics on a public cloud?

Let me start by saying the non-obvious. In 2013, running on the cloud was 1/3 the cost of running AWS with everything priced in.

If Zynga had tried in 2013 when our business was imploding to get to a private cloud, we would have never done it. But in 2009, our prescient CTO Cadir Lee made the strategic bet that we needed to have our own cloud. And by 2012, we had made the transition, allowing us in 2013 to manage our cash very effectively and survive a very long downturn without bleeding cash.

It’s important to realize that in 2011, I was in charge of what we would call DevOps / SRE / AppOps. And I loved AWS. And I was not interested in moving to zCloud. And then we launched CityVille. And the game was crashing, and we had to increase our server count by a factor of three. And when we begged for insight from Amazon, we learned that the problem was that the AWS team had switched NIC’s that had fewer IO buffers, which forced us to get more x86 servers to handle the IO load. That decision ate into our margins decisively, consumed lots of engineering cycles, and convinced me that we had to get off AWS ASAP.

I figured that this knowledge was so obvious that somebody somewhere was working on it as we spoke. And so, I didn’t bother looking for employment in that area.

The other thing I learned at Zynga was about the mobile business and software stack.

I learned at Zynga that vendors like Apple and Google make a lot of money adding software layers on top of Arm instructions to make it hard to write x-platform code.

Games are weird. They are very heavy in terms of graphics and very light in terms of UI. And so, a game can benefit significantly from being written in an x-platform toolkit because most of the time, you are playing on the game board.

And so, I pursued this strategy of trying to enable x-platform development, even to the point where we created our own programming language, playscript.

But it turns out that the x-platform strategy was a losing strategy because if the UI wasn’t native, the experience was marginally worse than some other game. The liquidity of games was so high that it made much sense to use those native features.

And so, the x-platform game strategy wasn’t quite as successful as I would have liked.

But I also learned that whereas with the UX, native integration matters a lot, the value of deep native integration with the back-end systems isn’t.

Customers do not use the back-end systems. Engineers use them. And those engineers – especially the software variety – are surprisingly willing to rebuild or change things.

And a game could be easily transferred from AWS to an on-prem data center if you just organized your system correctly and were prescient.

So how did the system have to be organized?

From my mobile experience, I realized that at the end of the day – every time you used something other than an x86 server, you were paying for someone else’s retirement. And every time you used some service, you were taking a dependency on someone else’s business plan.

Many companies tried to build meta-management layers that spanned the cloud and were a layer above the cloud API. I viewed that as a mistake. The only sane strategy was to have the same x86 server running in the cloud as you had on-prem.

And the only way to achieve that was to use the same kind of hypervisor.

At Zynga, we deliberately chose to run the same kind of Xen Hypervisor that AWS did for that reason. |

When I left Zynga, I was determined to make things like zCloud easy to build. And through a series of convoluted decisions, I ended up at VMware.

Now Kit Colbert talks a lot about multi-cloud in this post, which I urge you to read it.

My personal spin is that ultimately having a common layer that you can reliably deploy anywhere you want is critical. And the companies that last are the ones that can move their workloads and take advantage of the multi-cloud.

In 2013, I had not realized how special the Zynga team was. The guy who built our data centers was Allan Leinwand (currently SVP at Slack). The guy who built our storage system layer was the chief architect at Flipkart, Vinay YS. A key player on the infrastructure side of the house was Nick Tornow (VP of Engineering at Twitter).

The talent we had was staggering.

What VMware is doing is making the ability to run in multiple clouds something that can be done by small teams easily.

It’s a journey and a long one, but one that I am excited about working on.

45 architecturalist papers: a great software architect is a shepherd

May 24, 2021 by kostadis roussos Leave a Comment

Photo by Biegun Wschodni on Unsplash

A friend of mine and a colleague sent me a note today that he feels that his job feels more like a shepherd’s than anything else. He now had to look at what people were doing and keep telling them “not good enough” with the hope that it gets good enough at some point.
This contrasted with the model in his head of the software architect as gatekeeper.

Having worked with him for many years and seeing his growth as a person and a software architect, he articulated what I have been noodling about for many years.

The traditional mental model of a software architect is what several open-source projects implement. Every change is ultimately approved by one or more contributors who judge every change for clarity and quality.

That model makes the architect the gatekeeper of every coding decision.

That model might work well, but it does not work for me.

My model is more of that of a shepherd. I define in my head what is unacceptable. Unacceptable is anything I don’t understand. And that is anything that will break the customer’s deployment of the current product.

A breaking change need not break the current deployment of the existing product. It is always possible to transition to a new version of a system. A truly breaking change is – in my mind – a different product.

What about things I don’t understand? Understanding doesn’t mean agreement. And most importantly, it doesn’t mean code. It means the person sharing the technical data has given me enough details that I am confident I can understand what is being proposed and what it might manifest as, and I am okay with that.

To borrow from Gandalf, until I have understanding and believe this won’t break customers, “You shall not pass.”

Then the next set of questions are about the confidence that what you are building is being built well. My basic strategy is to hire great engineers who build great software. I assume the great engineers are going to produce great code. The real tricky bit is to make sure that the engineers believe that what they are building is the right thing for the right reasons. Like all of us, it’s often easier to say “Yes” to the wrong answer than to communicate “No.” Not every battle is worth fighting over.

My view is that as a software architect, I am required to say, “No.” And that my job is to discover if the engineers think what they are building is correct. And if it is not, be the fulcrum and lever that they need to get the space and time to figure out the right answer.

The other point is that once we know the correct answer, we can then agree to do the wrong thing because of time-to-market reasons. And that is an explicit tradeoff and not a consequence of never even considering the correct answer. And the ability to choose to do the wrong thing allowed flexibility and enabled the right long-term thing to be done. Without that ability, I have seen standoffs between engineering management, product management, and architects. The architects don’t trust the managers and the PM to do the right thing, so they only propose the right long-term solution, pushing out timelines, etc.

Again it’s about shepherding people through a process, not telling them what to do, and not reviewing everything they produce.

Why am I okay with this?

Look, I have 10 fingers, and I do more check-ins than any person in the company. Because I have a team of hundreds that does hundreds of check-ins per day, and they can operate independently, looking at the correct data. By leveraging their collective intelligence, we can go faster and innovate faster than any other leadership model.

Is there a risk that things will go badly? Yes. But I prefer to deal with the failures and then understand how to mitigate them going forward. To innovate is to fail. To do new things is to fail. To encourage people is to fail. I would instead fall that way than any other way. Because when you innovate, do new things, and encourage individuals and teams, greatness happens.

44 architecturalist papers: the value of a college degree

April 21, 2021 by kostadis roussos Leave a Comment

Over the last twenty years, I have been the least impressed with the value of formal education in the field of computer science for most practitioners who do most of the work.

As someone who struggled to do well in exams and avoided classes with exams, I never understood what they measured. I know that I never applied to Google because they had this ridiculous requirement to see my GPA. I graduated with honors and summa cum-laude. It wasn’t my grades; it was the principal.

Maybe it was being bullied. As a weird, obnoxious Greek, my experience at Brown was toxic. I didn’t fit in. And all I wanted was to put as much physical distance with that part of my life as I could. And maybe the experience taught me that maybe this degree wasn’t as valuable as I thought. And maybe I thought, if everyone at top-colleges were like some of the specific people I had to deal with, then I would prefer to never work with them again. And the easiest thing to do was never to go work where they worked.

Over the years, I had no issues with hiring or promoting people on my team with no CS degree. At Zynga, we did real paper facebook of every senior leader with no CS degree, which was illuminating. I do remember that the guy who was the architect for cafe-world had no degree in CS.

Schools teach you the wrong skills, abstract concepts that are of meandering value, and most importantly, the wrong interpersonal values. Production software is a team sport. School is an individual sport. Production software is about maintaining software over time, not hitting a deadline. Production software is about customers and business requirements, not some contrived technical problem that some poor Professor invented to grade your attention span in a class. Production software is about the wisdom of how broken something can be.

In fact, I will observe my high-school history class, and the value it placed in critically understanding the nature of information and sources has proven to be far more valuable than any CS class I ever took. I would trade any CS class for that one.

But I *never* dismissed a top college’s most important value to a student. In fact, while simultaneously shitting on the value of a CS degree, I would tell folks, “a CS degree at a top-school has extraordinary value for your first set of jobs that set you up for your next set.”

If you want to get hired, out-of-school, graduating from a top school is orders of magnitude more valuable than anything else. And then, because the next set of jobs is a function of your relationships that you build while you work, the next set of jobs. I met an Israeli sales guy, and he told me the same thing. He was infuriated that he never got a job at a top-tech company because he felt he didn’t have the right degrees from the right school

I graduated from Brown University with a degree in CS, where I barely understood algorithmic analysis (I still don’t get how to do induction). I didn’t know what a database was. I didn’t know what a compiler was.

But I got a job at SGI in the kernel group because Forest Baskett’s daughter was a student at Brown. Forest Baskett was the then CTO of SGI, and he interviewed students at Brown as part of a recruiting project.

That had more to do with anything.

At SGI and later NetApp, I got a part-time Master’s degree at great personal expense and filled in gaps in my understanding of the field. But I did that to get a military deferral. I was wealthy enough to do that.

Having said all of that, I think learning and a life-long love of learning are crucial to your personal happiness and success. And a college education, if you can afford it, can help in that way, maybe. And I do believe learning to analyze how people talk and think and learn critically is valuable. And if you need to learn some boutique knowledge and a college setting works for you, by all means, take a class.

But the real reason a top-college degree is valuable? Because recruiters go there. That’s all.

43 architecturalist papers: p-zero or die.

April 15, 2021 by kostadis roussos Leave a Comment

[so I tweaked an earlier post to be more inclusive and more relevant]

VMware has – possibly – the coolest skunkworks system at any company. Skunkworks projects are shared at a three-day internal conference known as RADIO.

During the year, employees across the company work on projects and produce papers based on those projects whose only purpose is to share them at RADIO and possibly get funded later.

Andrew Lambeth is a Fellow and all-around amazing person who gave an excellent talk titled P-zero or die. The point of the talk was how to take any big idea that you had and get it funded.

What does P-zero mean? Well, the P means priority. Everything in a company has a priority to decide what the team works on first, and the number zero is the highest priority level. Why zero? Because, well, it turns out, first you start with the most important things having priority 1. And then the teams realize that there are too many P1’s. So instead of moving everything down to P2, a new priority level is created, priority zero. And then, of course, there are too many priority zero’s, and someone says, “let’s create p -1”. Then everyone realizes how many scripts and tools will break and that this makes them look ridiculous. So the org finally does the hard work to figure out how to reprioritize. But why didn’t this happen when p-zero was created? Well, because computer scientists start counting at 0, not 1. So 0 is the lowest number, not 1.

The fundamental principle of the talk is that if you don’t make your big idea a p0, it will get deprioritized for other stuff. And the reason it got deprioritized is that it’s big, and its value proposition was unclear, and people didn’t understand what you were trying to accomplish.

And so here’s the checklist of things you need to do to make your big idea someone else’s p0.

1. Describe it effectively in 5 minutes.
2. Success is easy to measure.
3. Listeners must understand, not agree.
4. Have a document to share, not go over.
5. Describe it on any media.
6. Pitch it at every opportunity, relentlessly.

And the most crucial thing is this:

7. If you get no traction, then move on to the next big idea.

Sometimes a big idea’s time has not come, and you just need to let it go.

I liked the talk so much that I decided to make a t-shirt.

P-Zero or Die

42 architecturalist papers: filling a glass with water is easy, building a glass is hard.

April 10, 2021 by kostadis roussos Leave a Comment

So my dad is one of those legendary figures in modern science. His H-index is 70+. His research transformed what we think of as hospital medicine, from a series of narrowly scoped disciplines to a systems theory of the human body where the center was critical care of key systems. And along the way, he transformed medicine in at least two countries, Canada and Greece, with his tireless advocacy and work to improve the medical systems either directly or through the people and institutions he built.

Whenever I need humility, I talk to him.

I mean, I have done a lot of big things in my life, but no, I didn’t change the fundamental way we think about how we die (body first, brain last).

In oh-so-many meetings, I stand in front of a large number of engineers and product managers and engineering managers and say, “see this impossible hill; we will climb it.”

And there is a natural inclination, “how?” And “how fast?” And “we have so much insanely hard work to do!”

For most of my professional life, I view that as an affront. That the person asking the question was dismissing the solution because it was too hard. As if there was some easier path that I had deliberately chosen to ignore. I thought they were saying you came up with the wrong answer. And I would get angry, and pissed off, and frustrated.

Recently I took this EQi test. And a key element of that test was how much you used reality to make judgments. And, well, I scored poorly.

And that got me thinking, why? Because I tend to look at reality not as immutable but as mutable.

As someone who has seen how the future changed because of what individuals like my dad did, and more modestly I did, my opinion of the relentless forces of nature and history is that I will always be willing to use my living hand to challenge any invisible or dead hand.

So once again, I was in a meeting where someone said, “this is the moral equivalent of scaling the North Face in a snowstorm.”

And I was like – “Man, you and I are so different, in a world that is 70% water, I look at an empty glass and think how easy it is to fill it.”

And, of course, that went down poorly. Because that person felt I was dismissing their observation. And, to be honest, I was.

And so I told this to my dad. My dad goes, “blah. Filling a glass with water is easy. Go to a lake or the sea. Finding a glass is easy. Go to any store. But make the right kind of glass that fits in the right place. That’s what’s hard.”

And he then looked at me with that look, “Good you can do hard things, I am proud of you.” My dad was never one for recognition, etc. H was always about the next hill to climb. And then we spent the rest of the time talking about how filling that glass is so damned hard.

And it got me thinking a lot about how strategic software architecture is about finding that right glass. And that is hard. And when I find that glass, I am excited, and all I can see is how we could achieve miracles if we just filled it.

But the work has barely begun for everyone else.

And filling that glass with water is hard.

And being dismissive of the challenge and not recognizing that the work has barely begun is critical. And when people tell you, “dude, I am not climbing the North Face in a snowstorm unless you break this down a little bit more,” I shouldn’t get angry; I should be delighted. Any sane person would be like – “good luck.”

Next time, I’ll not stand waiting for the applause and get annoyed that everyone isn’t admiring my achievement of identifying the right glass.

So let me adjust my thinking.

When I find a glass, I know we can fill it with water. The effort to fill it with water in a timely fashion will make the finding of the glass feel like a trivial subtask.

And to ask people to be happy for me for finding the glass is kind of like the tenor asking for applause for clearing his voice. I’m asking them to trust me that I know where the safe path up the North Face is. Maybe, instead of asking them to applaud, I should start preparing for the climb and be grateful that they might follow me.

51 architecturalist papers: transactions and arrangements and architecture reviews

Like this:

50 architecturalist papers: portability and the multi-cloud

Like this:

A better physical working environment and the WFH

Like this:

48 architecturalist papers: system efficiency vs features and the multi-cloud debate

Like this:

47 architecturalist papers: talent and team over location, always

Like this:

46 architecturalist papers: the hybrid cloud

Like this:

45 architecturalist papers: a great software architect is a shepherd

Like this:

44 architecturalist papers: the value of a college degree

Like this:

43 architecturalist papers: p-zero or die.

Like this:

42 architecturalist papers: filling a glass with water is easy, building a glass is hard.

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: