Deduplication and Encryption
I spend a lot of my time talking with Permabit customers and more and more recently I have heard questions on proper use of encryption in their storage environments. Lost customer data is a huge risk to businesses, and risk often directly translates to cost, either from legal penalties or in cleanup. For example, companies which handle credit card data have been scrambling to comply with the PCI Data Security Standards, and still in the news we hear about horrors like the theft of 130 million credit cards numbers. Encryption is all about obscuring data, but deduplication is about seeing through and eliminating duplicates within your data. How can these coexist?
They can, but it depends a lot on when and how you encrypt your data. The challenge is in balancing when during your information lifecycle your data is encrypted, and how it is handled. If you encrypt data high up the stack, in the application, then it’s more likely to be end-to-end secure, but cannot be easily shared with other applications in your environment. If you encrypt data lower down the stack, in the storage, it can be easily shared but you must be careful how it is protected in transit. In many environments, different cryptographic implementations are appropriate for different kinds of data.
Broadly, there are two areas in which to consider encryption implementation — transport encryption and data encryption. Transport encryption is where all communication between two applications or servers is encrypted, delivering a secure communications channel. This means that both data and control commands are encrypted; this is important because it means that an eavesdropper cannot see your data and also cannot tell what sort of things you are doing with that data. With transport encryption, the data on the other side of the connection is generally handled in unencrypted form, so this does not protect against data leaks or maliciousness within the application. HTTPS is a common example of use of transport security.
Data encryption, on the other hand, is where the individual pieces of information being processed are encrypted. These may be processed by the application in an encrypted form, stored into a database, or written to disk. An untrusted application, like perhaps a storage system, can be handed data that has been encrypted without concern that the information will be leaked. An encrypted archive file on disk would be an example here. In some ways you could consider tape encryption as either or both types of security. The tape is carrying data and control information between two applications (or two runtime instances of the same application), so it could be considered a form of “transport” security, but if you consider the third-party handling your tapes as part of your storage infrastructure, it’s more like a form of data encryption.
When selecting a product that incorporates encryption one thing to consider is the encryption algorithm used; luckily, this is an easy choice. Only use products that implement AES, the Advanced Encryption Standard. Anything else is unlikely to be (or remain) secure…. the only possible exception would be Triple DES, but it is a cipher that is showing its age.
Recommendations for Deduplication
If you would like to combine deduplication and encryption, at some level the storage stack must have access to the unencrypted data so that it can identify duplicates. For a system with standard protocol interfaces, such as NFS and CIFS, this means not performing data encryption within your application on data that you want to deduplicate. This doesn’t mean not to use encryption at all, however.
Data that must be kept very secure and is also unlikely to deduplicate, such as credit card numbers, can be safely encrypted by the application. For protection of the remaining data, you must select a system that both delivers transport encryption between your application and the storage, and also includes data encryption internally to protect data on disk. Surprisingly, there aren’t many options available today.
Several drive vendors have begun to incorporate full-disk encryption (FDE) into their disks, and storage array vendors are just beginning to make use of this. This means that the data on such drives is protected against theft or loss of the drives, but the weakness is still that the device the drives are in must have the keys to unlock them. That means if someone walks away with a server and its disks, all bets are off. FDE drives are compatible with deduplication, though, because any deduplication activities are happening at a higher level in the storage system.
Permabit Enterprise Archive supports encryption at a number of different layers. This includes encryption to the client, encryption during replication, and encryption on disk. Permabit always uses the AES cryptographic algorithm, as mentioned above.
First and foremost, transport encryption is used wherever possible. If the application protocol (i.e. NFS, CIFS) supports an encrypted connection, we will deliver that. Unfortunately, this is not widely available today, with CIFS supporting secure authentication and some recent versions supporting secure transport.
Replication is always performed over an encrypted channel as well, even if the data being transported is already encrypted. This ensures that customer data is not replicated to an attacker outside your firewall that has surreptitiously tried to intercept your replication data stream. Additionally, because the transport channel is encrypted, an eavesdropper cannot tell anything about the sort of data being replicated, such as file sizes.
Finally, Enterprise Archive can optionally encrypt data on disk so that it is protected against theft or loss of the hardware. This option can be configured on a volume by volume basis. If someone were to walk away with one (or even all) of the hard drives in an Enterprise Archive install, they would not be able to make sense of any of the data. This offers strong protection against data theft from the equipment.
Permabit’s on-disk encryption offers additional protection beyond what full-disk hardware encryption offers, because of how encryption keys are handled. In a system with FDE disks, encryption keys must be stored on the server so as to unlock the disks when they are powered on. While this means that a stolen disk is of no use, a stolen server necessarily contains the keys that will unlock its disks. In Enterprise Archive, data encryption happens at the access node layer, significantly reducing the vulnerability profile of the keys. Because application transports like NFS require cleartext data, the access nodes must have access to the data encryption keys. For encrypted volumes, they encrypt and decrypt data as it flows from and to the storage nodes in the system. This means that storage nodes never have the keys necessary to decrypt the data that they hold and protect, a significant separation of responsibility.
Overall, deduplication and encryption are compatible, but to use them together you must take care on where you apply the encryption. For data to be deduplicated, encryption must take place within the storage system, and not at an application or gateway layer. To ensure data security, make sure information is always encrypted in flight whenever possible, especially during replication or backup.