wrong tool

You are finite. Zathras is finite. This is wrong tool.

  • Email
  • LinkedIn
  • RSS
  • Twitter

Powered by Genesis

23 architecturalist papers: latency kills

January 17, 2020 by kostadis roussos 2 Comments

While at NetApp, I saw the incredible effort that became known as ONTAP 8.0 and was part of the spinnaker acquisition.

From that experience, I learned a few seminal things that continue to resonate. The short version is that latency kills.

Let me start by saying, that the hard problem in storage is how to deliver low-latency and durability. Enterprise storage vendors earn their 70% gross margin because of the complexity in solving two issues that appear to conflict. The conflict is that durability requires a copy, and making a copy slows things down.

The solution was, and is, to use algorithms, in-memory data structures,  and CPU cycles to deliver latency and durability.

When Spinnaker was acquired, there was a belief within the storage industry that single-socket performance had reached a tipping point, and that performance could only be improved if we threw more sockets at the problem.

And, in retrospect, they were right. Except, we collectively missed another trend. Although the single-thread performance was no longer going to double at the same rate, the performance of media was going to go through a discontinuity and radically improve its performance.

But at the time, this wasn’t obvious.

And so many folks concluded that you could only improve performance through scale-out architectures.

The problem with scale-out architectures is that although single node-latency can be as good as local latency, remote latency is worse than local latency.

And application developers prefer, for simplicity, to write code that assumes uniform latency of the infrastructure.

And so applications tend to be engineered for the worst-case latency.

And single-node systems were able to compete with clustered systems. As media got faster, and as single-node performance improved, application performance on non-scale-out architectures was always better.

In short, the scale-out architectures delivered higher throughput, but worse latency.

And it turns out that throughput workloads are not, generally, valuable.

And so scale-out for performance has it’s a niche, but it was not able to disrupt non-scale out architectures.

Over time, clustered storage systems added different value than performance, but the whole experience taught me that customers will always pay for better latency. And that if there is enough money to be made in the problem space, it will be solved in such a way to avoid applications from changing.

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to share on Reddit (Opens in new window) Reddit
  • Click to share on X (Opens in new window) X
  • Click to share on Tumblr (Opens in new window) Tumblr
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
  • Click to share on WhatsApp (Opens in new window) WhatsApp

Like this:

Like Loading...

Filed Under: Architecturalist Papers, Storage

Comments

  1. Sudha Sundaram, Netapp says

    February 2, 2020 at 12:37 pm

    Good one, Kostadis! I am with you on this after addressing customer performance latency issues for years. New in-memory PMEM solutions from INTEL only reiterate this point.

    Reply
    • kostadis roussos says

      February 17, 2020 at 2:26 am

      Yes, they do. I find it ironic that a part of my career was spent trying to make clustered systems go faster, and ignored the possibility that a single node system could go faster even as processors speed slowed.

      Latency is hugely valuable.

      Reply

Leave a Reply to kostadis roussosCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d