<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Permabit &#187; Jered Floyd, CTO</title>
	<atom:link href="http://permabit.com/category/blog/jered-floyd/feed/" rel="self" type="application/rss+xml" />
	<link>http://permabit.com</link>
	<description>OEM Data Optimization Solutions</description>
	<lastBuildDate>Tue, 14 May 2013 16:51:16 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=</generator>
		<item>
		<title>The Performance Challenges of Data Optimization &#8211; Data Deduplication</title>
		<link>http://permabit.com/media-center/blogs/2010/12/data-deduplication/</link>
		<comments>http://permabit.com/media-center/blogs/2010/12/data-deduplication/#comments</comments>
		<pubDate>Fri, 03 Dec 2010 15:53:38 +0000</pubDate>
		<dc:creator>Jered Floyd</dc:creator>
				<category><![CDATA[Jered Floyd, CTO]]></category>
		<category><![CDATA[Albireo]]></category>
		<category><![CDATA[dedupe]]></category>
		<category><![CDATA[deduplication]]></category>
		<category><![CDATA[primary data optimization]]></category>
		<category><![CDATA[primary dedupe]]></category>
		<category><![CDATA[primary storage]]></category>
		<category><![CDATA[primary storage deduplication]]></category>

		<guid isPermaLink="false">http://blog.permabit.com/?p=1135</guid>
		<description><![CDATA[Data deduplication is like the big brother to lossless compression; it also depends on identifying redundant stretches of data, but does it across a much larger pool of data. While compression identifies duplicates of a few bytes within a file, deduplication locates much larger duplicate chunks, perhaps 4 KB or more, across the entire pool...]]></description>
				<content:encoded><![CDATA[<p><a  title="Data Deduplication" href="http://www.permabit.com/products/data-deduplication.asp">Data deduplication</a> is like the big brother to lossless compression; it also depends on identifying redundant stretches of data, but does it across a much larger pool of data.  While compression identifies duplicates of a few bytes within a file, <a  title="Deduplication" href="http://www.permabit.com/products/data-deduplication.asp">deduplication</a> locates much larger duplicate chunks, perhaps 4 KB or more, across the entire pool of storage, thus working across files, or even file systems and LUNs.  This provides the opportunity for much greater savings, but also introduces significant technical challenges.<span id="more-1135"></span></p>
<p>As I described in my last post, a data compression routine might store in memory all the data within the last 64 KB it has processed as candidates for duplicate elimination.  Since the candidates in deduplication are all the other chunks stored, there isn&#8217;t remotely enough memory for this to be done!  Instead, data deduplication uses a smaller content fingerprint, a cryptographic hash, to identify blocks.  This fingerprint is only 32 bytes in size, but will be unique across all of the chunks in the storage system.</p>
<p>As a new chunk comes into the storage system, <a  title="Data Deduplication" href="http://www.permabit.com/products/data-deduplication.asp">data deduplication</a> takes the fingerprint of that chunk and compares it against a table of all the other fingerprints of chunks in the storage system.  If the new fingerprint matches, the new chunk is a duplicate and the data need not be stored.  Across a large storage system with much repeated data, savings of up to 100x, or even higher, are possible, and the unique chunks can also be traditionally compressed for further savings.</p>
<p>The level of optimization possible with <a  title="Deduplication" href="http://www.permabit.com/products/data-deduplication.asp">deduplication</a> is enormous, but unfortunately the complexity of the implementation is equally huge.  The data deduplication routine must maintain a table of fingerprints, as well as the locations of the corresponding chunks.  In the simplest implementation, this requires around 64 bytes per entry.  With a 4 KB chunk size, this table would require 16 GB of RAM per TB of disk! That&#8217;s not a realistic expectation for any storage system, but thankfully improvements can be made.</p>
<p>The core part of any deduplicating system is this index mapping from content fingerprint to storage location, and it&#8217;s clear that this can&#8217;t be kept entirely in memory.  The first instinct is to put it on disk instead, but that would mean performing disk operations for every new block written to determine if it is a duplicate.  Even if this were kept down to a single seek, performance would be degraded by 50 to 80 percent; not an acceptable solution.</p>
<p>Caching portions of this index might speed this up, but unfortunately the nature of a cryptographic hash means that the fingerprints by which new blocks must be looked up are randomly distributed, so just because the last fingerprint began with &#8220;5&#8243; there&#8217;s very little chance the next will as well.  Caching by fingerprint ranges simply does not work, and so much more complex technologies are required.</p>
<p>In upcoming posts I will describe in more detail a number of the techniques that Permabit has developed and patented that allow us to overcome the challenges in building a scalable, high performance index for <a  title="Data Deduplication" href="http://www.permabit.com/products/data-deduplication.asp">data deduplication</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://permabit.com/media-center/blogs/2010/12/data-deduplication/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Performance Challenges of Data Optimization &#8211; Data Compression</title>
		<link>http://permabit.com/media-center/blogs/2010/11/compression/</link>
		<comments>http://permabit.com/media-center/blogs/2010/11/compression/#comments</comments>
		<pubDate>Fri, 19 Nov 2010 16:05:57 +0000</pubDate>
		<dc:creator>Jered Floyd</dc:creator>
				<category><![CDATA[Jered Floyd, CTO]]></category>
		<category><![CDATA[Add new tag]]></category>
		<category><![CDATA[Albireo]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[dedupe]]></category>
		<category><![CDATA[enterprise storage]]></category>
		<category><![CDATA[primary storage]]></category>
		<category><![CDATA[primary storage deduplication]]></category>

		<guid isPermaLink="false">http://blog.permabit.com/?p=1122</guid>
		<description><![CDATA[Unlike thin provisioning, compression allows for space savings on storage actually in use for data, and different variants can operate at both block and file storage levels.  Compression technologies fall broadly into two categories: lossless and lossy.  Lossless compression works by identifying redundancies within a short stretch of data, such as a file or block,...]]></description>
				<content:encoded><![CDATA[<p>Unlike thin provisioning, compression allows for space savings on storage actually in use for data, and different variants can operate at both block and file storage levels.  Compression technologies fall broadly into two categories: lossless and lossy.  Lossless compression works by identifying redundancies within a short stretch of data, such as a file or block, and eliminating those duplicate parts.  As the name implies, all of the original data is always recovered when the data is read.<span id="more-1122"></span></p>
<p>In contrast, lossy compression degrades a file by identifying portions of data that don&#8217;t need to be kept because they have minimal impact on the presentation of that data; as such, this is limited to human-consumed media formats.  For example, the human eye senses colors with lower resolution than it senses brightness, so nearly all image compression starts by throwing out three-quarters of the color information. On the other hand, you wouldn&#8217;t want your spreadsheet rounding all values to the nearest whole number just to save space! Because of this, lossy compression is only useful at the level of an application creating or managing a data file, since only at that level can it know what degree of content degradation is acceptable.</p>
<p>At the level of the storage only lossless compression is appropriate, and all the applicable technologies work in similar ways.  Compression algorithms look at a relatively small portion of data, identify short runs of duplicate data even a few bytes in length, and replace these with shorter references to a table of redundancies.  In the process they identify other inefficiencies, such as bits not being used in that stretch of data. This shorter form, often two to four times smaller, is then written.</p>
<p>Because compression operates within a short stretch of data the memory requirements are generally quite small. On the other hand, many of the more complex algorithms require a lot of computational horsepower, especially during write.  Without specialized hardware, this need can create a potential performance bottleneck.  Additionally, compression is a challenge for block-based storage because block storage operates on fixed-sized blocks, and cannot easily manage the smaller blocks created by a compression technology.</p>
<p>In my next post I&#8217;ll discuss data deduplication, which in many ways is like the &#8220;big brother&#8221; to data compression in that it also eliminates redundant stretches of data.  Because data deduplication operates at a much larger scope, however, the underlying technologies are completely different from compression and the opportunities for space savings are far greater.  As I will discuss, both technologies can be combined for even more effective data optimization.</p>
]]></content:encoded>
			<wfw:commentRss>http://permabit.com/media-center/blogs/2010/11/compression/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Performance Challenges of Storage Optimization &#8211; Thin Provisioning</title>
		<link>http://permabit.com/media-center/blogs/2010/11/thinprovisioning/</link>
		<comments>http://permabit.com/media-center/blogs/2010/11/thinprovisioning/#comments</comments>
		<pubDate>Wed, 10 Nov 2010 20:53:35 +0000</pubDate>
		<dc:creator>Jered Floyd</dc:creator>
				<category><![CDATA[Jered Floyd, CTO]]></category>
		<category><![CDATA[Add new tag]]></category>
		<category><![CDATA[dedupe]]></category>
		<category><![CDATA[deduplication]]></category>
		<category><![CDATA[primary dedupe]]></category>
		<category><![CDATA[primary storage]]></category>
		<category><![CDATA[primary storage deduplication]]></category>
		<category><![CDATA[Thin Provisioning]]></category>

		<guid isPermaLink="false">http://blog.permabit.com/?p=1096</guid>
		<description><![CDATA[In my last post, I identified the three biggest challenges to implementing primary storage optimization &#8211; avoiding performance degradation, supplementing all existing functionality, and preventing data lock-in.  Over the next few posts I will go into more detail on the first, performance. Any storage optimization method is based on finding and eliminating data that doesn&#8217;t...]]></description>
				<content:encoded><![CDATA[<p>In my last post, I identified the three biggest challenges to implementing primary storage optimization &#8211; avoiding performance degradation, supplementing all existing functionality, and preventing data lock-in.  Over the next few posts I will go into more detail on the first, performance.</p>
<p>Any storage optimization method is based on finding and eliminating data that doesn&#8217;t need to be stored.  This can take many forms, and the different forms provide different levels of savings and have different levels of complexity of implementation.  I&#8217;ll start with the simplest methods: thin provisioning and zero elimination.<span id="more-1096"></span></p>
<p>Thin provisioning and zero elimination are similar storage optimization technologies that can provide significant storage efficiencies for block (SAN) storage.  Thin provisioning optimizes storage by not physically allocating parts of LUNs that have not yet been written to by an application; since data hasn&#8217;t been written yet applications shouldn&#8217;t be expecting anything important should they try to read these parts of disks, and returning zeroes is just fine.  Similarly, some storage technologies now implement &#8220;zero elimination&#8221;, by which they recognize when an application writes a very large number of zeroes (to clear out storage no longer being used, for example) and deallocate these parts of the underlying disk, thus freeing up space.</p>
<p>Both of these mechanisms only provide savings for storage not actually being used.  They act as free space optimization rather than data optimization, and help work around the fact that in block storage it&#8217;s very inconvenient to change the storage device size after it has been created.  These techniques allow large LUNs to be allocated for ease of future use, but don&#8217;t force the immediate cost of backing these large LUNs with disk space consumption.</p>
<p>The complexity of implementation is relatively low for thin provisioning.  The underlying storage device must include a table that can map requests from logical addresses (i.e. LUN and LBA) to a physical address.  This is usually done on consecutive parts of the disk (&#8220;extents&#8221; or &#8220;pages&#8221;) of 1 MB or larger because new portions of disk are most often written to linearly. An application writing new data will generally write one block after another, instead of writing one block, then skipping a megabyte of empty disk, writing another block, and so on.  Assuming 16 bytes are necessary for each entry in this table, only 16 megabytes of RAM are necessary per terabyte of disk actually in use.  No additional memory is required to perform space savings, since this table provides the information of whether or not a disk portion has yet been allocated, and performance is very high since this table lookup is very quick.</p>
<p>Zero elimination is essentially the thin provisioning process in reverse.  When the storage system recognizes that an allocated extent has been filled entirely with zeroes, the underlying disk is deallocated so it is available for future use, and the entry is removed from the allocation table.  The system will generally scan for these runs of zeroes as a background process so there is some disk overhead, but this can frequently be built into other disk scrubbing operations.</p>
<p>Given the relative ease with which thin provisioning and zero elimination can be implemented, and the low memory and computational requirements, it&#8217;s clear why nearly every modern block storage architecture includes these technologies.  Since these techniques only serve to optimize space that is not actually in use, however, the amount of savings delivered is limited.</p>
<p>In upcoming posts I&#8217;ll talk about further techniques that perform data optimization on primary storage.</p>
]]></content:encoded>
			<wfw:commentRss>http://permabit.com/media-center/blogs/2010/11/thinprovisioning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Avoiding Obstacles to Primary Storage Optimization</title>
		<link>http://permabit.com/media-center/blogs/2010/11/avoiding-obstacles-to-primary-storage-optimization/</link>
		<comments>http://permabit.com/media-center/blogs/2010/11/avoiding-obstacles-to-primary-storage-optimization/#comments</comments>
		<pubDate>Tue, 02 Nov 2010 19:34:02 +0000</pubDate>
		<dc:creator>Jered Floyd</dc:creator>
				<category><![CDATA[Jered Floyd, CTO]]></category>
		<category><![CDATA[dedupe]]></category>
		<category><![CDATA[deduplication]]></category>
		<category><![CDATA[primary dedupe]]></category>
		<category><![CDATA[primary storage]]></category>
		<category><![CDATA[primary storage deduplication]]></category>

		<guid isPermaLink="false">http://blog.permabit.com/?p=1062</guid>
		<description><![CDATA[Introduction to The CTO Series Primary storage optimization consistently ranks as one of the top interests for storage purchasers today, and it&#8217;s no wonder why; this term encompasses a variety of technologies that can greatly reduce the effective cost of primary storage on magnetic disk or SSD. These technologies range in effectiveness from thin provisioning...]]></description>
				<content:encoded><![CDATA[<p><em><strong>Introduction to The CTO  Series</strong></em></p>
<p>Primary <a  title="Storage Optimization" href="http://www.permabit.com/solutions/backup-optimization.asp">storage optimization</a> consistently ranks as one of the top interests for storage purchasers today, and it&#8217;s no wonder why; this term encompasses a variety of technologies that can greatly reduce the effective cost of primary storage on magnetic disk or SSD.  These technologies range in effectiveness from thin provisioning and zero elimination to compression and <a  title="Deduplication" href="http://www.permabit.com/products/data-deduplication.asp">deduplication</a>, but all involve making more space usable for data storage on some amount of physical storage with a high and only slowly decreasing cost.  The more data than can be put into that physical storage, the lower the effective cost per gigabyte becomes.<span id="more-1062"></span></p>
<p>While technologies that prevent the complete waste of storage, such as thin provisioning, are now commonplace, primary storage vendors have been slow to build in components such as <a  title="Deduplication" href="http://www.permabit.com/products/data-deduplication.asp">deduplication</a> which can drive much greater savings by <a title="Storage Optimization" href=" http://www.permabit.com/solutions/backup-optimization.asp">optimizing stored data</a>. NetApp has been successful with their relatively limited deduplication functionality; our recent announcements with BlueArc and Xiotech deliver the first credible challenges to that, delivering both superior deduplication and performance.  <a  title="Deduplication" href="http://www.permabit.com/products/data-deduplication.asp">Deduplication</a> is a complex technology, and there are significant obstacles to delivering it in primary storage.</p>
<p><em>Avoiding Performance Degradation</em></p>
<p>First among these is performance; delivering deduplication at a rate fast enough for tier 1 primary storage while covering enough data to provide significant savings.  Together, these two performance aspects are why deduplication has been mostly limited to backup and archive applications.  In disk-to-disk backup applications there are long periods of inactivity during which computationally intensive deduplication can be performed on stored data, and because of the high rate of savings even the most popular backup dedupe solutions only need to scale to 30 to 50 TB of disk.  In contrast, primary storage environments may have hundreds of terabytes of unique data and there are no guaranteed quiet times for catching up using post-process deduplication.</p>
<p>To deliver dedupe with the performance necessary for primary storage, Permabit leveraged our more than 10 years of deduplication experience.  With our Enterprise Archive solution we developed and patented data structures and processing methods that allow us to identify duplicate chunks of data in less than 10 microseconds, on average, and to scale to multiple GB per second, meeting the performance requirements for even the most demanding inline deduplication of primary data streams. Our <a  title="Albireo - Data Optimization Software" href="http://www.permabit.com/albireo/albireo-overview.asp">Albireo data optimization software</a> is able to identify these duplicates across multiple petabytes of unique data, rather than the mere 16 TB of our closest competitor.</p>
<p><em>Supplementing All Existing Functionality</em></p>
<p>The next obstacle to avoid in delivering <a  title="Storage Optimization" href="http://www.permabit.com/solutions/backup-optimization.asp">storage optimization</a> is the masking of core storage functionality.  Approaches to optimization, such as those seen with Ocarina, Storwize and Exar, often involve a filter layer that sits between the storage consumer and the storage device.  Beyond the problem of potentially acting as a bottleneck to performance, this filter approach means that the back-end storage is treated as &#8220;dumb&#8221; disk and the storage client only sees the interfaces the intermediate layer is able to present.  If the intermediate optimization layer doesn&#8217;t support snapshots, then the client doesn&#8217;t get snapshots.  If it doesn&#8217;t support record-level retention, the client doesn&#8217;t get retention, and so forth for any other differentiating storage functionality.</p>
<p>To avoid this problem, <a  title="Permabit Albireo" href="http://www.permabit.com/albireo/albireo-overview.asp">Permabit Albireo</a> is delivered as a software library for our storage partners to integrate into their primary storage solutions.  Albireo is not deployed as a separate appliance or software layer, but rather works directly with our partners&#8217; existing block and file systems.  This requires a development effort on the part of our partners, but the work involved is light relative to other product features and results in a deeply integrated solution.  Because our partners use <a  title="Albireo - Data Optimization Software" href="http://www.permabit.com/albireo/albireo-overview.asp">Albireo</a> as a duplicate identification service inside their existing stack, deduplication is provided across all other existing storage functionality.</p>
<p><em>Preventing Data Lock-In</em></p>
<p>Finally, layered approaches at the appliance or software level introduce the serious risk of data lock-in.  When a deduplication solution, such as the ones mentioned above, acts as a filter layer on top of block or file system it must reformat the data being written to disk storage.  The file system features are completely reimplemented at the filter layer, leaving the data in a proprietary format on the back-end storage.  This data than requires rehydration through the filter layer in order to be retrieved.  A storage client attempting to read the storage directly will be unable to interpret the data, and any future migration will require reading back through the filter layer and rehydrating all of the data stored.</p>
<p>This data lock-in is a tremendous risk.  If the filter layer ever breaks or goes away, all customer data becomes inaccessible.  Even backups will be unusable unless all data is always rehydrated prior to backup, a slow and wasteful process.</p>
<p>Albireo addresses this problem too by integrating tightly with the existing storage stack.   Albireo acts as a duplicate advisory service, identifying duplicate data across a large storage pool as it is written, but then the underlying block or extent unification is recorded within the storage vendor&#8217;s file or block metadata.  There is no rehydration required on read or backup.  If Albireo were removed from the storage system all customer data would remain accessible, just no additional deduplication would occur.  This behavior makes Albireo a much safer option than a layered approach to deduplication.</p>
<p>Primary <a  title="Storage Optimization" href="http://www.permabit.com/solutions/backup-optimization.asp">storage optimization</a> through <a  title="Deduplication" href="http://www.permabit.com/products/data-deduplication.asp">deduplication</a> is a critical feature that will become a requirement for primary storage purchases over the next twelve months, but there are several obstacles that make providing this feature a challenge for storage vendors. Storage vendors cannot risk their existing performance and functionality when delivering dedupe, and they must avoid dangerous data lock-in to a solution they do not control.  BlueArc and Xiotech have been the first to announce conquering these challenges by integrating our <a  title="Albireo - Data Optimization Software" href="http://www.permabit.com/albireo/albireo-overview.asp">Albireo</a> high-performance <a  title="Deduplication" href="http://www.permabit.com/products/data-deduplication.asp">deduplication</a> engine and in doing so have raised the bar for the rest of the industry.</p>
]]></content:encoded>
			<wfw:commentRss>http://permabit.com/media-center/blogs/2010/11/avoiding-obstacles-to-primary-storage-optimization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Compression and Dedupe: Redux</title>
		<link>http://permabit.com/media-center/blogs/2010/06/compression-and-dedupe-redux/</link>
		<comments>http://permabit.com/media-center/blogs/2010/06/compression-and-dedupe-redux/#comments</comments>
		<pubDate>Tue, 22 Jun 2010 21:27:28 +0000</pubDate>
		<dc:creator>Jered Floyd</dc:creator>
				<category><![CDATA[Jered Floyd, CTO]]></category>
		<category><![CDATA[CTO]]></category>
		<category><![CDATA[dedupe]]></category>
		<category><![CDATA[deduplication]]></category>
		<category><![CDATA[primary storage]]></category>

		<guid isPermaLink="false">http://blog.permabit.com/?p=867</guid>
		<description><![CDATA[Yesterday The Storage Alchemist at Storwize posted a complaint about Tom&#8217;s discussion of compression and deduplication. We certainly aren&#8217;t savaging compression technologies &#8212; I think perhaps it&#8217;s clearer to consider our points not so much as a criticism of compression, but as a list of concerns regarding bump-in-the-wire optimization appliances. We absolutely agree with Steve...]]></description>
				<content:encoded><![CDATA[<p>Yesterday The Storage Alchemist at Storwize <a  href="http://www.thestoragealchemist.com/marketing-fud-and-doing-what-you-do-best/">posted a complaint</a> about Tom&#8217;s <a  href="http://permabit.com/index.php/2010/06/compression-and-dedupe-business-value-and-data-safety/">discussion of compression and deduplication</a>. We certainly aren&#8217;t savaging compression technologies &#8212; I think perhaps it&#8217;s clearer to consider our points not so much as a criticism of compression, but as a list of concerns regarding bump-in-the-wire optimization appliances. We absolutely agree with Steve the Alchemist that data compression and data deduplication are two technologies that complement one another well &#8212; we use both in our <a  href="http://www.permabit.com/products/data-center-series.asp">Enterprise Archive Value NAS</a> and <a  href="http://www.permabit.com/products/cloud-storage.asp">Cloud Storage</a> offerings , and we make it possible for our partners to compress (if they so choose) when using our <a  href="http://www.permabit.com/albireo/deployment.asp">Albireo SDK</a>.</p>
<p>I&#8217;ll comment on his technical concerns. <span id="more-867"></span></p>
<p>Compression and deduplication are very similar in that they identify and eliminate redundant data, but the scope of this duplicate identification is vastly different. Traditional compression works on a small window of data and with short duplicate segments so that the compression tables fit efficiently in a very small amount of memory. Storwize may not be using a 64 KB window, but I imagine the order of magnitude is about right&#8230; and that&#8217;s not a criticism of their technology at all. In fact, the way Storwize manages data in chunks so that they can maintain performance is very clever.</p>
<p>Calling deduplication lossy is nonsense; both compression and dedupe replace redundant data with references to other instances of that data, just at different scales as I note above. Unlike Ocarina&#8217;s NFO, which frighteningly throws away actual content, both dedupe and traditional compression return the original bitstream.  Tom&#8217;s point was that Albireo embedded dedupe leverages existing file and block system concepts to make those references so no interaction with our software is required on read, while a compression appliance modifies the data format before it reaches the storage array, which creates data lock-in. Take away the appliance, and the storage is full of uninterpretable data. That&#8217;s a concern for storage vendors and users alike.</p>
<p>As to the chart, when you look at this as &#8216;embedded dedupe&#8217; vs. &#8216;appliance-mediated compression&#8217;, you can see why Tom says that appliance compression alters the data, and Albireo dedupe does not require &#8216;rehydration&#8217;.  As for &#8216;optimizes block&#8217;, I haven&#8217;t yet seen Storwize&#8217;s block optimzation products, so I can&#8217;t comment, but I do wonder how they make the space saved to compression available to the user? We agree that savings are absolutely data dependent. In general, deduplication alone offers more savings than compression alone, and both together give the best results by far. Perhaps we can work together to ensure Albireo and Storwize yield optimal results?</p>
]]></content:encoded>
			<wfw:commentRss>http://permabit.com/media-center/blogs/2010/06/compression-and-dedupe-redux/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Albireo &#8211; Storage Optimization Realized</title>
		<link>http://permabit.com/media-center/blogs/2010/06/albireo-storage-optimization-realized/</link>
		<comments>http://permabit.com/media-center/blogs/2010/06/albireo-storage-optimization-realized/#comments</comments>
		<pubDate>Wed, 09 Jun 2010 14:45:41 +0000</pubDate>
		<dc:creator>Jered Floyd</dc:creator>
				<category><![CDATA[Jered Floyd, CTO]]></category>
		<category><![CDATA[dedupe]]></category>
		<category><![CDATA[Dedupe2.0]]></category>
		<category><![CDATA[deduplication]]></category>
		<category><![CDATA[primary dedupe]]></category>
		<category><![CDATA[primary storage]]></category>
		<category><![CDATA[primary storage deduplication]]></category>

		<guid isPermaLink="false">http://blog.permabit.com/?p=824</guid>
		<description><![CDATA[In my last post, I gave the history of Albireo and I mentioned that we came to recognize seven key attributes that are absolute requirements for an integrated primary deduplication solution. First, Albireo supports block, file and also new unified or converged storage platforms.  By addressing all types of primary storage, we avoid leaving huge...]]></description>
				<content:encoded><![CDATA[<p>In <a href="../../../../../index.php/2010/06/a-star-is-born/">my last post</a>, I gave the history of Albireo and I mentioned that we came to recognize seven key attributes that are absolute requirements for an integrated primary deduplication solution.</p>
<p>First, Albireo supports <strong><a  href="http://www.permabit.com/albireo/architecture.asp">block, file and also new unified or converged storage platforms</a></strong>.  By addressing all types of primary storage, we avoid leaving huge amounts of users&#8217; data unoptimized.  Additionally, next generation storage platforms put block and file data on the same underlying storage, and Albireo makes it possible to identify and deduplicate data across both.<span id="more-824"></span></p>
<p>Next, we uniquely provide the ability <strong>to scale deduplication across a pool of storage many petabytes in size</strong>, instead of limiting deduplication to smaller islands of a few terabytes.  This is critical to delivering high rates of deduplication.</p>
<p>Further, Albireo delivers <strong>sub-file, content aware deduplication</strong>.  Whole file single instancing just doesn&#8217;t cut it for common primary data files, like Office documents or virtual system images. Albireo can identify optimal boundaries in a variety of file types and then deduplicate segments as small as the storage can support.  This delivers industry-leading deduplication efficiency to our partners.</p>
<p>As I explained in my last post, Albireo is successful because it is an <strong>embedded, integrated solution</strong>.  Integrating directly with primary storage vendor&#8217;s technology, it <strong>leverages all their existing R&amp;D</strong>.  We also provide the capability for Albireo to be <strong><a  href="http://www.permabit.com/albireo/deployment.asp">integrated as inline, post-process, or parallel deduplication</a></strong>, whichever matches the underlying storage platform the best.  This means that there are no rough edges where features of the underlying storage are lost; the deduplication is transparent and automatic.  Users may not even know that their storage is using Albireo for deduplication, except for the levels of savings and performance far beyond what anyone has seen before.</p>
<p>Finally, because Albireo is delivered as a software tool kit, it is <strong>integrated outside of the storage read path</strong>.  Our technology solves the hardest parts of deduplication, namely sub-file duplicate identification, and then leverages existing vendor file and block system metadata for eliminating duplicates.  Because of this, read operations only look at the file or block metadata without the need to consult our indexes, meaning we have no impact on performance, functionality, or data integrity.  Even if our software is turned off, all user data remains accessible.  This is completely unique in the industry.</p>
<p>I&#8217;m extremely excited to be able to now talk publicly about Albireo and the deduplication benefits it provides to existing primary storage technologies.  We&#8217;ve been focused on this for the past year, and the success of our partners&#8217; integration efforts has confirmed the ease of integration and technological power of the Albireo toolkit.  Through our partners users will be using this soon, and I&#8217;m sure that they&#8217;ll be pleased with the results.</p>
]]></content:encoded>
			<wfw:commentRss>http://permabit.com/media-center/blogs/2010/06/albireo-storage-optimization-realized/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Star is Born</title>
		<link>http://permabit.com/media-center/blogs/2010/06/a-star-is-born/</link>
		<comments>http://permabit.com/media-center/blogs/2010/06/a-star-is-born/#comments</comments>
		<pubDate>Tue, 08 Jun 2010 14:43:32 +0000</pubDate>
		<dc:creator>Jered Floyd</dc:creator>
				<category><![CDATA[Jered Floyd, CTO]]></category>
		<category><![CDATA[dedupe]]></category>
		<category><![CDATA[Dedupe2.0]]></category>
		<category><![CDATA[deduplication]]></category>
		<category><![CDATA[primary deduplication]]></category>
		<category><![CDATA[primary storage]]></category>
		<category><![CDATA[primary storage deduplication]]></category>

		<guid isPermaLink="false">http://blog.permabit.com/?p=819</guid>
		<description><![CDATA[In his post, Tom wrote about the top three things we heard from customers about deduplication. Given the wildfire success of deduplication for backup storage, everyone now wants deduplication to optimize primary storage, but nobody is willing to sacrifice performance, functionality, or safety.  This is absolutely sensible &#8211; deduplication should be a valuable, cost-saving feature and...]]></description>
				<content:encoded><![CDATA[<p>In his <a href="../../../../../index.php/2010/06/left-lane-driving-and-primary-storage-optimization/">post</a>, Tom wrote about the top three things we heard from customers about deduplication. Given the wildfire success of deduplication for backup storage, everyone now wants deduplication to optimize primary storage, but nobody is willing to sacrifice performance, functionality, or safety.  This is absolutely sensible &#8211; deduplication should be a valuable, cost-saving feature and not a tradeoff against other core functionality. Nobody has been able to deliver this &#8211; until <a  href="http://www.permabit.com/albireo/albireo-overview.asp">Albireo</a>.<span id="more-819"></span></p>
<p>Permabit has been in the deduplication business since 2000, more than ten years, and we&#8217;ve learned a great deal about both technology and customer requirements. In fact, we&#8217;ve explored delivering deduplication to the OEM storage vendor market for some time, after including it for many years in our <a  href="http://www.permabit.com/products/data-center-series.asp">Enterprise Archive</a> product. If you&#8217;re not familiar with it, our Enterprise Archive is a complete stack solution for efficient value-tier storage, delivering our own file interfaces, file system, <a  href="http://www.permabit.com/products/rain-ec.asp">RAIN-EC</a> data protection, and hardware.  When talking with tier 1 storage vendors we were told many times, &#8220;we&#8217;ve invested millions in our file systems and data protection; yours is great, but we really just want deduplication.  Can you give us just that?&#8221;</p>
<p>For a long time I, along with the rest of the industry, thought the answer was &#8220;no&#8221;. When our engineers explored just providing dedupe we ended up with complex &#8220;bump-in-the-wire&#8221; appliances that sat in front of the storage and treated it almost like JBOD, masking performance and functionality, and jeopardizing integrity through data lock-in to the solution. We didn&#8217;t find this acceptable, and refused to try and sell it.  Others weren&#8217;t as resolute and have tried to bring solutions like this to market and have found it challenging and less than rewarding.</p>
<p>Then, a bit over a year ago, I had an idea. What if we could provide a development kit that delivered the core technologies in deduplication and integrated into the vendor&#8217;s existing storage stack, rather than sitting outside it? We could avoid competing with our partners on functionality, and eliminate the concerns that Tom explained so clearly. That&#8217;s what <a  href="http://www.permabit.com/albireo/albireo-overview.asp">Albireo</a> is &#8211; the core technologies that make fast, scalable deduplication possible, packaged in a way that they can be integrated into any storage vendor&#8217;s stack in a matter of days to weeks. Our engineering team took this idea, extracted several years of deduplication effort from our core research, enhanced it further and packaged it as a complete SDK, all in under a year.</p>
<p>I named this project Albireo, after the most visible <a  href="http://en.wikipedia.org/wiki/Albireo">binary star system</a>. I thought this captured the basic idea of the product; a technology that works alongside an existing storage stack to deliver a powerful deduplicating storage solution that looks and acts as a single product. And of course, it also makes multiple data instances appear as one.</p>
<p>In the process of developing Albireo we learned seven key requirements for an integrated, deduplication solution. Be sure to read my next post to find out how these have all been critical to the success of Albireo with our partners.</p>
]]></content:encoded>
			<wfw:commentRss>http://permabit.com/media-center/blogs/2010/06/a-star-is-born/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Albireo Seven Requirements for Data Optimization</title>
		<link>http://permabit.com/video/albireo-seven-requirements-for-data-optimization/</link>
		<comments>http://permabit.com/video/albireo-seven-requirements-for-data-optimization/#comments</comments>
		<pubDate>Sun, 06 Jun 2010 16:47:17 +0000</pubDate>
		<dc:creator>perma-admin</dc:creator>
				<category><![CDATA[Jered Floyd, CTO]]></category>
		<category><![CDATA[data optimization]]></category>
		<category><![CDATA[data reduction]]></category>
		<category><![CDATA[dedupe]]></category>
		<category><![CDATA[deduplication]]></category>
		<category><![CDATA[enterprise storage]]></category>
		<category><![CDATA[Permabit]]></category>
		<category><![CDATA[primary storage deduplication]]></category>

		<guid isPermaLink="false">http://antfarmdesign.com/?post_type=video&#038;p=1929</guid>
		<description><![CDATA[by Jered Floyd, CTO &#38; Founder]]></description>
				<content:encoded><![CDATA[<p>by Jered Floyd, CTO &amp; Founder</p>
]]></content:encoded>
			<wfw:commentRss>http://permabit.com/video/albireo-seven-requirements-for-data-optimization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Primary Storage Deduplication is the Future</title>
		<link>http://permabit.com/media-center/blogs/2009/10/primary-storage-deduplication-is-the-future/</link>
		<comments>http://permabit.com/media-center/blogs/2009/10/primary-storage-deduplication-is-the-future/#comments</comments>
		<pubDate>Tue, 06 Oct 2009 21:20:41 +0000</pubDate>
		<dc:creator>Jered Floyd</dc:creator>
				<category><![CDATA[Jered Floyd, CTO]]></category>
		<category><![CDATA[CTO]]></category>
		<category><![CDATA[dedupe]]></category>
		<category><![CDATA[deduplication]]></category>
		<category><![CDATA[primary storage]]></category>

		<guid isPermaLink="false">http://blog.permabit.com/?p=514</guid>
		<description><![CDATA[Until now, I&#8217;ve chosen to stay out of the little tempest in a teapot that&#8217;s going on over at Chuck Hollis&#8217; blog, but it doesn&#8217;t seem to be quieting down. He basically says that Data dedupe has no place on primary storage, which flies in the face of where the dedupe market is going&#8230; but...]]></description>
				<content:encoded><![CDATA[<p>Until now, I&#8217;ve chosen to stay out of the little <a  href="http://chucksblog.emc.com/chucks_blog/2009/09/a-quick-note-on-primary-data-dedupe-and-io-density.html">tempest in a teapot</a> that&#8217;s going on over at Chuck Hollis&#8217; blog, but it doesn&#8217;t seem to be quieting down.  He basically says that <a  title="Data Dedupe" href="http://www.permabit.com/products/sdr.asp">Data dedupe</a> has no place on primary storage, which flies in the face of where the dedupe market is going&#8230; but it&#8217;s not a bad position to take when you&#8217;re company makes a lot of money off of very expensive primary storage. <span id="more-514"></span></p>
<p>Their biggest NAS competitor took the bait, and NetApp jumped into the fray. In between the vitrol some good points are made about why Chuck is wrong.  For example, if deduplication is increasing access to common blocks it means that you&#8217;ll be seeing much better cache efficiency,  which will offset additional load on the drives.  The &#8220;boot storms&#8221; he talks about with many virtual machines hosted on the same storage are actually less likely to occur with deduplication than without!</p>
<p>Now <a  href="http://blogs.hds.com/hu/2009/10/i-agree-with-chuck-on-data-dedupe.html">Hu Yoshida has weighed in on Chuck&#8217;s side</a>, but then laid out the much more reasonable view that virtualization, tiering and dynamic provisioning are critical in an environment with costly top-tier primary storage.  That&#8217;s correct, but it doesn&#8217;t mean that deduplication at that top tier isn&#8217;t a huge win as well.</p>
<p>Deduplication has seen its first successes in the D2D backup space, where it&#8217;s easy to get a lot of deduplication due to the data patterns and traditional backup schedule.  Applying deduplication beyond backup is hard, because the opportunities for deduplication are fewer and further between, and so these D2D backup devices have <a  href="http://permabit.com/?p=390">never been able to address archive or primary storage effectively</a>. That doesn&#8217;t mean dedupe is bad for primary, it just means that it&#8217;s harder to do.</p>
<p>At Permabit, we consider dedupe for backup to be Dedupe 1.0, and the future for innovation is in Dedupe 2.0, which includes dedupe for primary and cloud storage. We host a forum over at <a  href="http://www.dedupe2.com/">Dedupe 2.0</a> to discuss this further, and recently released our Permabit Cloud Storage product to address these new customer needs.  I can&#8217;t give too much detail, but we&#8217;re constantly at work making our deduplication technology available to ever broader markets.</p>
<p><a  title="Dedupe" href="http://www.permabit.com/products/sdr.asp">Dedupe</a> for primary is a huge win for the storage consumer, but it&#8217;s taken us nearly a decade of extensive technology and patent development to solve the scalability and speed challenges needed for that market.  I think it&#8217;s no coincidence that the two voices denouncing primary dedupe the most, HDS and EMC, has no products to offer that include a feature which will soon become a customer requirement.</p>
<p>If you&#8217;re going to be at <a  href="http://snwusa.com/">Storage Networking World</a> next week and would like to hear more on primary dedupe, <a  href="http://tanejagroup.com/">Arun Taneja</a> is moderating a panel, &#8220;Primary Storage: The New Frontier for <a  title="Data Deduplication" href="http://www.permabit.com/products/sdr.asp">Data Deduplication</a>&#8220;.  I&#8217;ll be there, along with Val Bercovici from NetApp, Carter George from Ocarina, and Peter Smails from Storwize.  It should be a lively discussion!  Perhaps Chuck will stop by?</p>
]]></content:encoded>
			<wfw:commentRss>http://permabit.com/media-center/blogs/2009/10/primary-storage-deduplication-is-the-future/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Deduplication and Encryption</title>
		<link>http://permabit.com/media-center/blogs/2009/08/deduplication-and-encryption/</link>
		<comments>http://permabit.com/media-center/blogs/2009/08/deduplication-and-encryption/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 17:03:28 +0000</pubDate>
		<dc:creator>Jered Floyd</dc:creator>
				<category><![CDATA[Jered Floyd, CTO]]></category>
		<category><![CDATA[archiving]]></category>
		<category><![CDATA[dedupe]]></category>
		<category><![CDATA[deduplication]]></category>
		<category><![CDATA[Encryption]]></category>
		<category><![CDATA[enterprise data archive]]></category>
		<category><![CDATA[enterprise storage]]></category>

		<guid isPermaLink="false">http://blog.permabit.com/?p=499</guid>
		<description><![CDATA[I spend a lot of my time talking with Permabit customers and more and more recently I have heard questions on proper use of encryption in their storage environments. Lost customer data is a huge risk to businesses, and risk often directly translates to cost, either from legal penalties or in cleanup. For example, companies...]]></description>
				<content:encoded><![CDATA[<p>I spend a lot of my time talking with Permabit customers and more and more recently I have heard questions on proper use of encryption in their storage environments.  Lost customer data is a huge risk to businesses, and risk often directly translates to cost, either from legal penalties or in cleanup.  For example, companies which handle credit card data have been scrambling to comply with the <a  href="https://www.pcisecuritystandards.org/">PCI Data Security Standards</a>, and still in the news we hear about horrors like the <a  href="http://news.yahoo.com/s/ap/20090817/ap_on_re_us/us_hacker_charges">theft of 130 million credit cards numbers</a>.  Encryption is all about obscuring data, but deduplication is about seeing through and eliminating duplicates within your data.  How can these coexist?<span id="more-499"></span></p>
<p>They can, but it depends a lot on when and how you encrypt your data.  The challenge is in balancing when during your information lifecycle your data is encrypted, and how it is handled.  If you encrypt data high up the stack, in the application, then it&#8217;s more likely to be end-to-end secure, but cannot be easily shared with other applications in your environment.  If you encrypt data lower down the stack, in the storage, it can be easily shared but you must be careful how it is protected in transit.  In many environments, different cryptographic implementations are appropriate for different kinds of data.</p>
<p>Broadly, there are two areas in which to consider encryption implementation &#8212; transport encryption and data encryption.  Transport encryption is where all communication between two applications or servers is encrypted, delivering a secure communications channel.  This means that both data and control commands are encrypted; this is important because it means that an eavesdropper cannot see your data and also cannot tell what sort of things you are doing with that data.  With transport encryption, the data on the other side of the connection is generally handled in unencrypted form, so this does not protect against data leaks or maliciousness within the application.  <a  href="http://en.wikipedia.org/wiki/Https">HTTPS</a> is a common example of use of transport security.</p>
<p>Data encryption, on the other hand, is where the individual pieces of information being processed are encrypted.  These may be processed by the application in an encrypted form, stored into a database, or written to disk.  An untrusted application, like perhaps a storage system, can be handed data that has been encrypted without concern that the information will be leaked.  An encrypted archive file on disk would be an example here.  In some ways you could consider tape encryption as either or both types of security.  The tape is carrying data and control information between two applications (or two runtime instances of the same application), so it could be considered a form of &#8220;transport&#8221; security, but if you consider the third-party handling your tapes as part of your storage infrastructure, it&#8217;s more like a form of data encryption.</p>
<p>When selecting a product that incorporates encryption one thing to consider is the encryption algorithm used; luckily, this is an easy choice.  Only use products that implement AES, the <a  href="http://en.wikipedia.org/wiki/Advanced_Encryption_Standard">Advanced Encryption Standard</a>.  Anything else is unlikely to be (or remain) secure&#8230;. the only possible exception would be Triple DES, but it is a cipher that is showing its age.</p>
<p><strong>Recommendations for Deduplication</strong></p>
<p>If you would like to combine <a  href="http://www.permabit.com/products/sdr.asp" title="deduplication">deduplication</a> and encryption, at some level the storage stack must have access to the unencrypted data so that it can identify duplicates.  For a system with standard protocol interfaces, such as NFS and CIFS, this means not performing data encryption within your application on data that you want to deduplicate.  This doesn&#8217;t mean not to use encryption at all, however.</p>
<p>Data that must be kept very secure and is also unlikely to deduplicate, such as credit card numbers, can be safely encrypted by the application.  For protection of the remaining data, you must select a system that both delivers transport encryption between your application and the storage, and also includes data encryption internally to protect data on disk.  Surprisingly, there aren&#8217;t many options available today.</p>
<p>Several drive vendors have begun to incorporate full-disk encryption (FDE) into their disks, and storage array vendors are just beginning to make use of this.  This means that the data on such drives is protected against theft or loss of the drives, but the weakness is still that the device the drives are in must have the keys to unlock them.  That means if someone walks away with a server and its disks, all bets are off.  FDE drives are compatible with deduplication, though, because any deduplication activities are happening at a higher level in the storage system.</p>
<p><a  href="http://www.permabit.com/products/data-center-series.asp">Permabit Enterprise Archive</a> supports <a  href="http://www.permabit.com/products/privacy-access.asp">encryption</a> at a number of different layers.  This includes encryption to the client, encryption during replication, and encryption on disk.  Permabit always uses the AES cryptographic algorithm, as mentioned above.</p>
<p>First and foremost, transport encryption is used wherever possible. If the application protocol (i.e. NFS, CIFS) supports an encrypted connection, we will deliver that.  Unfortunately, this is not widely available today, with CIFS supporting secure authentication and some recent versions supporting secure transport.</p>
<p><a  href="http://www.permabit.com/products/replication.asp">Replication</a> is always performed over an encrypted channel as well, even if the data being transported is already encrypted.  This ensures that customer data is not replicated to an attacker outside your firewall that has surreptitiously tried to intercept your replication data stream.  Additionally, because the transport channel is encrypted, an eavesdropper cannot tell anything about the sort of data being replicated, such as file sizes.</p>
<p>Finally, Enterprise Archive can optionally encrypt data on disk so that it is protected against theft or loss of the hardware.  This option can be configured on a volume by volume basis. If someone were to walk away with one (or even all) of the hard drives in an Enterprise Archive install, they would not be able to make sense of any of the data.  This offers strong protection against data theft from the equipment.</p>
<p>Permabit&#8217;s on-disk encryption offers additional protection beyond what full-disk hardware encryption offers, because of how encryption keys are handled.  In a system with FDE disks, encryption keys must be stored on the server so as to unlock the disks when they are powered on.  While this means that a stolen disk is of no use, a stolen server necessarily contains the keys that will unlock its disks.  In Enterprise Archive, data encryption happens at the access node layer, significantly reducing the vulnerability profile of the keys.  Because application transports like NFS require cleartext data, the access nodes must have access to the data encryption keys.  For encrypted volumes, they encrypt and decrypt data as it flows from and to the storage nodes in the system.  This means that storage nodes never have the keys necessary to decrypt the data that they hold and protect, a significant separation of responsibility.</p>
<p>Overall, <a  href="http://www.permabit.com/products/sdr.asp">deduplication</a> and encryption are compatible, but to use them together you must take care on where you apply the encryption.  For data to be deduplicated, encryption must take place within the storage system, and not at an application or gateway layer.  To ensure data security, make sure information is always encrypted in flight whenever possible, especially during replication or backup.</p>
]]></content:encoded>
			<wfw:commentRss>http://permabit.com/media-center/blogs/2009/08/deduplication-and-encryption/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
