Protecting Against Data Rot
Last week one of my favorite technology journalists, David Pogue at the New York Times, wrote about the problem of data rot. (I wish my web videos were as funny as his.) He spoke with Dag Spicer of the Computer History Museum about the challenges they face trying to restore data from old media.
I found it interesting, but not surprising, that they have better luck reading very old media versus newer failing media. While this might seem counterintuitive at first, I think it makes a lot of sense. Old data storage mechanisms were very simple to understand once they were invented and built. I remember but ten years ago when disk drives didn’t do any sort of automatic remapping of defective blocks — instead the drive came with a label identifying the blocks that tested bad, and you had to enter this into your computer so it knew to avoid them. Today, that’s encoded on the drive and much more sophisticated software would be required to reconstruct data from just the raw media.
Similarly, the magnetic domains on today’s storage are hundreds of thousands times smaller than 20 to 30 years ago. With today’s read technology we can recover media that has suffered significant damage, but there’s just so much less damage modern media can sustain and still be readable.
The Computer Museum’s problems are largely related to abandoned media that has been unmaintained and ignored, though, and Spicer doesn’t give hope to modern users who are still writing and preserving data. For users today the situation is not nearly so dire! Media migration and data preservation can be automated, as with the future-proof media migration built in to every Permabit Enterprise Archive system.
I last wrote about this in the first article in my data preservation series, No Silver Bullet: Archive Challenges. Organizations like the National Archives and Records Administration (NARA) recommend copying archive data to new, modern media every three to five years; this solves both the danger of media degration through media refresh, and the danger of media obsolescence through technology refresh. With large archive data sets this means that media refresh is occurring on a nearly continuous basis — there’s always some data on media reaching the end of its lifecycle.
Addressing this manually is a time-consuming, error-prone and painful process, as discussed in Pogue’s interview. That’s why we’ve automated it in our systems. New storage nodes can be added at any time with the latest and greatest technology, and older nodes can be removed as they reach the end of their life. All data movement is handled automatically, and data is never at risk during any of these internal migrations. This ensures that, regardless of how many petabytes of data you have, you’ll never run into the bit rot problem with a Permabit Enterprise Archive.
An entirely separate issue that Pogue doesn’t talk about in his interview with Spicer is the problem of logical readability. Just because you can get the raw data back doesn’t mean you can make sense of it. You may not have an application that can read it anymore! This would be like playing one of Edison’s robust wax cylinders only to find that it contained a message in a dead language! More on how to solve that in No Silver Bullet: Logical Readability.