23 architecturalist papers: latency kills

While at NetApp, I saw the incredible effort that became known as ONTAP 8.0 and was part of the spinnaker acquisition.

From that experience, I learned a few seminal things that continue to resonate. The short version is that latency kills.

Let me start by saying, that the hard problem in storage is how to deliver low-latency and durability. Enterprise storage vendors earn their 70% gross margin because of the complexity in solving two issues that appear to conflict. The conflict is that durability requires a copy, and making a copy slows things down.

The solution was, and is, to use algorithms, in-memory data structures, and CPU cycles to deliver latency and durability.

When Spinnaker was acquired, there was a belief within the storage industry that single-socket performance had reached a tipping point, and that performance could only be improved if we threw more sockets at the problem.

And, in retrospect, they were right. Except, we collectively missed another trend. Although the single-thread performance was no longer going to double at the same rate, the performance of media was going to go through a discontinuity and radically improve its performance.

But at the time, this wasn’t obvious.

And so many folks concluded that you could only improve performance through scale-out architectures.

The problem with scale-out architectures is that although single node-latency can be as good as local latency, remote latency is worse than local latency.

And application developers prefer, for simplicity, to write code that assumes uniform latency of the infrastructure.

And so applications tend to be engineered for the worst-case latency.

And single-node systems were able to compete with clustered systems. As media got faster, and as single-node performance improved, application performance on non-scale-out architectures was always better.

In short, the scale-out architectures delivered higher throughput, but worse latency.

And it turns out that throughput workloads are not, generally, valuable.

And so scale-out for performance has it’s a niche, but it was not able to disrupt non-scale out architectures.

Over time, clustered storage systems added different value than performance, but the whole experience taught me that customers will always pay for better latency. And that if there is enough money to be made in the problem space, it will be solved in such a way to avoid applications from changing.

Comments

Sudha Sundaram, Netapp says

February 2, 2020 at 12:37 pm

Good one, Kostadis! I am with you on this after addressing customer performance latency issues for years. New in-memory PMEM solutions from INTEL only reiterate this point.

- kostadis roussos says
  
  February 17, 2020 at 2:26 am
  
  Yes, they do. I find it ironic that a part of my career was spent trying to make clustered systems go faster, and ignored the possibility that a single node system could go faster even as processors speed slowed.
  
  Latency is hugely valuable.

Share this:

Like this:

Comments

Leave a Reply to kostadis roussosCancel reply