Content-Aware Segmentation
![]()
Albireo data optimization technology delivers deduplication at the sub-file level. Facilities are provided for both fixed-block and variable-block data deduplication. Variable-block deduplication begins when data is intelligently segmented into chunks of variable length (based on an analysis of its content). This additional step can provide substantial savings over and above more traditional fixed-block deduplication schemes. Fixed-block deduplication cannot reclaim space when duplicate chunks of data are not aligned on block boundaries. For example, if container objects (such as ZIP archives) that share some files in common are broken into fixed-size blocks, the chunks will deduplicate only if embedded files are stored with identical alignment within each container, which is unlikely. To handle deduplication of chunks with arbitrary alignment, Albireo supports content aware deduplication APIs. These APIs perform intelligent variable segmentation based on the type of stream or file being processed.
Albireo’s content-aware segmentation breaks large container objects into variable-sized chunks as new data is pushed to Albireo. Content-aware scanners analyze the data stream to identify chunk boundaries to provide optimal deduplication. This is a benefit where duplicate data occurs in container objects having different block alignments (as when the same files appear within two Zip archives). Content-aware segmentation ensures that the embedded files are located and deduplicated even if the files appear at different offsets within two container objects.
Albireo provides a “plug-in architecture” for content-aware, variable-length segmentation along with several scanner modules for popular formats. Albireo uses content “scanners” to identify and optimize deduplication of objects within specific compound data formats (e.g. Microsoft Office documents, ZIP, PDF, tar). Data is analyzed by the scanners in real-time delivering rapid identification and deliver the data in optimal formats to enable efficient data optimization. The scanners loaded in Albireo, mentioned above, provide a set of scanners most often found in business data stores. An API is available for OEMs to create and implement their own application-specific scanners for further customization and savings.