The Patent Truth
Yesterday we announced details on five patents that we were recently awarded. Being one of the very first companies to develop technology for data deduplication we have had an extensive portfolio of patent filings, however the US Patent Office has been so swamped with work that these are only now being issued, many of them eight years after filing. It’s been very exciting to see these finally pop out the other end of the patent system, and we’re looking forward to many more finally making it through in the coming year.
Patents are written in a strange dialect of English colloquially known as “patentese”, so it’s hard to casually tell what they’re about — this is especially true of patents in high-tech. It’s easy to read a patent that’s for a better mousetrap (even if it’s titled “Mechanism for detecting presence, dispatching and retaining murine pests”), but what is “Storage system for randomly named blocks of data” or “History preservation in a computer storage system” about? Allow me to provide a bit of a secret decoder ring.
Records Retention
We have a bunch of new patents, so I’ll focus on two of the most interesting ones. The most recent is US Patent No. 7,478,096, “History preservation in a computer storage system”. If you click through you’ll see a lot of confusing wording, but the place to begin is at the start. There you’ll see “a method by which a disk-based distributed data storage system is organized for protecting historical records of stored data entities,” followed but a list of steps and components to make it clearer what we’re talking about.
From very early on, Permabit has included advanced features for records retention in Permabit Enterprise Archive. Things like WORM storage and retention policy management solve critical business problems such as complying with government regulations including SEC rule 17a-4 and FDA 21 CFR Part 11, both of which require data integrity and enforced retention of records. Until recently, such records were retained electronically only on write-once optical disk, or write-once tape. These solutions didn’t provide the accessibility required today to meet litigation discovery requests or to perform data mining operations.
Permabit Enterprise Archive was one of the very first products to allow enforced WORM retention of records on magnetic disk, and we filed key patents on the enabling technologies. This ’096 patent protects our grid-based storage technologies that enforce records retention, be it through a file share or an object interface. Multiple versions of files can be stored over time, and even different versions can be protected and preserved.
Scalable Deduplication
The other patents I’ll discuss today protect some of the technologies we’ve developed that deliver Scalable Data Reduction, our efficient, in-line deduplication mechanism. The relevant patents here are US Patent No. 7,457,800, and it’s partner ’813. The description here is a little less transparent, “storage system for randomly named blocks of data.” What’s this have to do with deduplication?
To understand, you need a very quick overview of how hash-based deduplication works. When data is being written into Permabit Enterprise Archive, our SDR technologies break that data up into smaller chunks for deduplication purposes. We then must very quickly identify if any of those chunks have been previously stored. We do this by taking a content fingerprint, or cryptographic hash — today that is the SHA-256 hash function, the integrity of which I discussed in a previous blog entry. Then we look and see if we already have a chunk with the same fingerprint.
It turns out that it’s really hard to do this when you have lots of data, and that’s the fundamental limit to scaling for archive deduplication systems. If you have a 100 terabytes of information you might have 10 billion different chunks, each with a 32-byte long name. That’s 320 gigabytes just of names! You can’t use an old-fashioned file system to store those and check, as it will take many seconds just to walk through the directory structure… that’s why companies like NetApp have to have background processes that just can’t keep up. Even a traditional database can’t do this quickly because all the names are evenly (randomly) distributed, so you can’t predictively cache part of the list. Think of it like a dictionary — just because you looked up “aardvark” doesn’t make it more likely you’ll look up another word beginning with “A” next.
To solve this problem companies like Data Domain have used computer science data structures called “Bloom filters”. These make it possible to scale to tens of terabytes, but break down as you go higher, so we developed our own technologies to scale out further, the “delta indexes” described in our patent.
This is one way in which we can scale dedupe beyond what other vendors can even dream of today. The other component is our decentralized grid architecture. I’ll explain how that helps SDR in a future post.
The Patent Pipeline
As I said up top, many of these patents were filed in the very early days of the company and are just finally making their way through the patent office now. Well fewer than half of our filings have completed the process to date, so I expect to have more good news to share in the future!