Dirty Little Secrets About Dirty Little Secrets
Over at eWeek, Chris Preimesberger has what looks like a brutal article on archive systems. On further inspection, though, it looks more like a damning (but nameless) indictment of EMC’s Centera. In that context, he’s spot on, but I wish he didn’t try to drag down all of archiving with one rotten apple!
Let’s look through the problems Chris identifies.
1: Scalability. CAS (content-addressable storage) archives have a hard limit on the number of objects that can be stored.
Let’s clear one thing up right here, Archives are not CAS. CAS — content-addressed storage — is a technology that some archive systems use for deduplication and data authenticity, but this has nothing to do with the system being used as an archive. CAS is not a market, is not a product, and is not an interface — it’s just a class of technologies. Technologies that, in general, make no difference to you from a user perspective. It’s time we moved beyond talking about “CAS”, because it’s just a red herring.
All storage systems have to balance resources, and some choose poorly. Have you ever run out of inodes in your file system because you had too many small files? That’s the same problem that he talks about here. You won’t run into that problem with a Permabit Enterprise Archive… we’re optimized to deliver top performance with an average file size as small as 8 kilobytes. Even with potentially small files like emails you’ll still be able to use all of your archive storage.
2: Performance degradation. As objects pile up in an archive, the speed at which the archive runs slows down tremendously.
Again, all storage systems have the potential to slow down as they fill, not just archive systems. But the (Nexsan-specific) answer seems particularly odd… it talks entirely about databases, specifically single vs. dual databases.
At some level I suppose you can think of any storage system as a very simple database. Block storage maps a LUN and block offset to a 512 byte block. File storage maps a share and path name to a file. Object storage maps an object name to a data object. How that’s implemented internal to the storage system, however, varies greatly.
I’ll agree with Bob Woolery that having a single, monolithic database is an absolutely wrong approach to any data storage. Scalability and performance will both suffer as that database becomes a bottleneck to all I/O with the system. But just mirroring your database, as Assureon apparently does, isn’t much better! You’ve simply doubled (or so) the capacity you can scale to before you start running into the same problems.
The only way to cleanly handle scaling is through a fully distributed architecture. There cannot be a single point, or small set of points, through which all data must flow or that will always become the system bottleneck. For this reason, Permabit Enterprise Archive has a fully distributed architecture. While every write operation involves multiple nodes in an Enterprise Archive system for redundancy and reliability, none of these nodes is required for all writes. Consecutive operations involve difference subsets of nodes in the system. As more nodes are added to the system to increase capacity, the same number of operations uses the same nodes less frequently. Conversely, the larger number of nodes are able to handle a greater number of operations in aggregate, increasing overall system performance as the system scales.
This sort of scalability of performance can only be achieved with a fully distributed system. Any architecture with single (or dual) databases will always slow down as capacity increases.
3: Data protection. The existence of the commodity hardware “back door.”
This seems to say that you can’t be assured of your data safety in a system that does not have integral storage, so don’t trust an archive product that acts as a gateway to a SAN. Again, I’m not sure what’s archive specific here. Even with integrated storage a malicious character can corrupt data, unless the drive bays on the storage appliance are wired to 50,000 volts.
There are a great many storage administrators using SAN gateway products for NAS and archive storage, and I’d bet they’d likely dispute this “secret”, but let me provide a far more compelling argument of why you should choose an appliance over a gateway — it’s far more cost effective. Why buy a pricey gateway to use on $30/GB SAN storage, when you can get an integrated Permabit appliance for $5/GB or less?
We have another good reason why Permabit Enterprise Archive is only available as an appliance, not a gateway, and it does have to do with data protection but not any multi-path “back door” concern. Permabit has developed our patent pending RAIN-EC data protection technology which is capable of providing data protection up to 250 times more reliable than RAID 6 on equivalent disks. RAIN-EC has advanced coding algorithms for distributing data across multiple disks in multiple nodes, so you’re protected not just against the loss of a drive but also the failure of any component (or multiple components) anywhere in the storage system. This requires precise data distribution in the system, the sort of which is not possible in a gateway architecture.
There are plenty of reasons to choose an integrated appliance for archive, but worrying about back-end access to the storage seems like the least of them. Most SAN administrators seem quite content with their SAN security.
4: Data migration: When an archive is moved, the files can become orphaned and the entire process could become exceedingly slow.
Here the complaints become almost non-sensical, focusing on the problems of proprietary APIs rather than anything related to archive at all. It’s a fair complaint that if you’re using a storage system with a proprietary API you have to bring the data back out through the original application to move it — if the original application allows for this at all. (Centera apps are notoriously tricky at this; once you have data in it may as well be in a roach motel.)
Even if you have a standard API or other interface, however, migration is still an issue. Consider purchasing a NAS-connected archive storage system and loading your petabytes of archive data into it. In three to five years, the vendor is going to come knocking on your door again, offering to sell you the latest and greatest. And also offering to sell you migration services. Even though that device has standard interfaces, migrations are still time consuming, costly, and risk-prone endeavors. I’ve seen many a migration project cost more than the new storage system put in place!
Permabit Enterprise Archive is designed to avoid these sorts of migration headaches. Because of Enterprise Archive’s grid-based architecture, there’s no single critical point in the system. Access nodes and storage nodes are all connected together via standard Ethernet, and individual nodes can be removed and replaced without any system downtime.
New nodes can be of different generations, different capacities, or even different storage technologies. In this way, the system can be organically, piecewise upgraded over time without ever having to go offline. Over 20 years, 50 years or longer you can continually refresh every component, taking advantage of industry improvements in storage density, power efficiency, and performance. Through this whole process you never once have to pay for migration professional services.
5: Energy efficiency. Not the best in most archive systems.
This goes back again to the “single database” confusion early on in the strange secret number 2, so I’m not quite sure what to say here other than to repeat that Enterprise Archive has no bottleneck central database.
Deduplication is an inherently green technology. If you can get even 2x deduplication out of your archive storage system, that’s half the number of drives that have to be manufactured, purchased and spinning — massive energy savings. Depending on data set we’ve seen savings everywhere from 20% to 300x, all of these reflecting energy efficiency over conventional primary storage.
Real “dirty secrets” you should be concerned about
Most of Chris’ storage concerns are valid… but only for a single vendor not actually named in the article. There definitely are dirty secrets that you should be asking your archive storage vendor about, though; consider the following:
Availability. Archive data may be infrequently accessed, but when it is needed it’s needed immediately. Does your archive product include full high-availability (HA) features? Is there any single point of failure that can take the system offline?
Reliability. Archive data may be the last and final copy of critical business information. How reliable is your archive storage system over the long term? Are you using older RAID technologies that might not hold up with modern high capacity drives?
Longevity. Archive data may need to be preserved for 50 years, 100 years, or even indefinitely. How does your storage system provide for long-term archival storage? How do I integrate new technologies? Will I have to migrate to new storage systems every few years to ensure data availability?
Scalability. Data sets for archive storage may be many petabytes in size, and deduplication rates will vary widely. How much disk can your archive storage system address? Don’t sell me a 30 TB box and tell me it stores a petabyte!
Cost. Long-term archive data needs to be stored inexpensively, both today and over the long term. How much is the real cost of storage in your archive storage system, before assuming any deduplication? How much will it cost to maintain and operate? How much will it cost to handle media migrations as components regularly reach the end of their usable lives?
Finally, a challenge. Move beyond “CAS” — I challenge storage industry writers to abolish the term by the end of the year. In the past six months I have not once heard a customer ask for “CAS”. CAS is a technology, not an interface, and not a user feature…. only the trade press seem to be keeping this term alive.
Archive storage is a tier of storage and a storage market. XAM is a storage interface for object storage. EMC Centera is a product with a proprietary API. There is no CAS as a market, as a tier, or as an interface. It’s time to kill CAS.