permabits and petabytes blog oem data optimization for next generation storage OEM Data Optimization Solutions

Archive for January, 2009

Great time to be an IT pro and storage buyer

Tom Cook, Permabit CEO Hello. I'm Tom Cook, CEO & President of Permabit Technology Corporation. Jered Floyd has been doing many blog posts regarding technology trends for the storage industry. I’m jumping in now to provide a bit of a different perspective. I’ll write about topics and issues I hear about when talking with enterprise IT and storage leaders. Despite the increasing pressures of managing the explosive growth of digital inform...

Data Protection’s Black Swan: Seagate Drive Failures

At this point there's been lots of press coverage on the very high failure rates on Seagate's Barracuda 7200.11 desktop drives. Last Friday, Seagate came clean and admitted this is due to a firmware bug and that the bug affects several other drive families as well. The good news is that the problem doesn't affect the integrity of the data stored on the drive, but the bad news is that if the bug has already hit you'll have to send your drive to ...

Thoughts for a National CTO

As has been covered extensively over the past two months, President-Elect Obama has announced plans to appoint a National CTO. As a CTO, I was asked recently what my comments and suggestions would be to Obama and this new, federal CTO. I thought about this for a while, and I think there are a number of things that are critical to the success of a National CTO initiative.  First and foremost, I think it's absolutely necessary that the National ...

The Green Is A Lie

The Consumer Electronics Show is on this week and "green" is big news yet again, with 22 percent of consumers willing to pay more for the label, but even more being skeptical of what that label really means. They're right to be concerned. George Crump says today that "green is a 'nice to have'," because the number one concern in the current economy is ROI, and cost justification based on power savings is a very long term prospect. I agree -- c...

The Memory Bandwidth Gap

Happy new year, everyone! It's now 2009, which means I'll be writing the wrong date on my checks for another few months at least. We're celebrating 2009 with a new addition to our family:


Gir, the storage bullmastiff

Over at StorageMojo, Robin comments on the challenges of shared memory controllers with multi-core processors. This is actually something that's been a big problem for regular software development for a while now, and is especially important in the storage space.

There's a big problem today, which is that processors keep getting faster, but memory latency and bandwidth aren't keeping up. A processor can perform a complicated operation in a nanosecond, but retrieving the data to operate on might take ten times that long. This is particularly an issue with storage because nearly everything involves moving data to and from the system, and all that data has to pass through main memory. If it has to pass through the processor as well you can rapidly use up available memory bandwidth; this is one of the reasons why technologies like RDMA have been developed.

Even worse, there don't seem to be any good tools for identifying if a software process is performance-bound by memory bandwidth or latency! You might naively increase the processor speed or number of cores in your system in order to increase performance and find no change at all. Let me explain.

Let's say you're working on improving performance for an application. It's easy to observe that the network interface is saturated, and thus your process is network I/O-bound -- to improve performance you need to add more I/O bandwidth or change your wire encoding. With a little more instrumentation you can determine if your process is network latency-bound -- waiting on remote requests all the time -- and know that you need to add more network parallelism.

Similarly, it's easy to tell if you are disk bandwidth or latency-bound -- you'll always be in disk wait.

If you're not waiting on the disk and you're not waiting on the network, the default assumption is that you are CPU-bound -- add a faster processor and you're on your way, or optimize the areas that your profiler shows you spending time in to run in fewer cycles. But this frequently doesn't help today.

Processor speed has greatly outstripped memory speed. If you're operating on data in registers or in cache, adding a faster processor can help. Most data is in main memory, however, and you need to get it onto the chip -- fulfilling a request from main memory can take dozens of processor clock cycles! The instruction cannot be processed until the data has been retrieved, so even if the processor were twice as fast, it couldn't get more done.

This is why modern processors have whizz-bang features like out-of-order execution, branch prediction, processor virtualization and parallelism... have enough stuff in-flight that you can always be processing something while waiting for memory requests to be fulfilled. This can mask a lot of the memory delays, but at some point you run out of things the processor can do.

Other than experimentally, how can you tell if a process is memory-bound? As far as I know, all profiling mechanisms will show such a process as CPU-bound, because the sampler will find the instruction pointers sitting in routines doing lots of computation on things in memory, and this will be indistinguishable from any number crunching those routines do. I'm pretty sure I can ask the chipset about cache miss statistics, but that really doesn't tell me much.

This is really important, because it can tell you if it's productive to jump through hoops trying to eliminate in-memory copies (for example) versus just needing a faster processor. It can tell you that adding additional memory channels or faster memory is a win. But I can't find any way to determine this from a profiling perspective.

Here's a good paper explaining the problem further, with the money quote "In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips."

The authors divide compute time into processing time, memory latency stall time, and memory bandwidth stall time. This is exactly the data I'd like to see -- but they've gathered it by running SPECmarks on a simulated processor and memory architecture. I'd love to gather profiling data in situ, or at least on a synthetic modern Intel architecture... Cachegrind gets you part of the way there, but not far enough.

Meanwhile we work hard to increase system performance with the tools and instrumentation that we have available. We've produced a 50% performance improvement over the past six months, and are on track to repeat that again!