About Data Deduplication

Data de-duplication is a popular subject within the storage world. Data de-duplication is defined as the process of “eliminating redundant copies of data – copies that are created during complete system backups, e-mail attachments distributed to multiple users, shared documents, music and projects, etc” (Mandagere, et al., 2008). There have always been ways of reducing data redundancy such as file compression by zip utilities, but de-duplication reduces redundancy: both intra-file and inter-file. De-duplication not only effects storage space, but it also benefits network bandwidth and throughput on the SAN: less data to transfer. Another advantage of this is more efficient backups: de-duplication of data within disk-to-tape and disk-to-disk backups enhance the backup times and reduces backup capacities. Basically, de-duplication enhances speed of data transfer and eliminates much CPU overhead by as much as 30%. Research by Mandagere suggests that “between different de-duplication techniques the space savings varies by about 30%, the CPU usage [on the client host] differs by almost 6 times [after initial de-duplication] and the time to reconstruct a de-duplicated file can vary by more than 15 times.” (Mandagere, et al., 2008). Within D2D (Disk to Disk) storage, “De-duplication reduces the amount of redundant data that is backed up, which results in less capacity required to store that data.” (Asaro & Biggar, 2007). Centralized de-duplication software can, as an example, reduce the file size of shared documents: a user embeds an image into a document, and that document is shared via email to numerous people. It that embedded image is put into other documents; the storage space has grown just from that one image. De-duplication software could reduce that space by smart management – by storing just one copy of that image. The image will then be transferred to the document on demand.

I have posted a good document on this subject by ESG on my server at ftp://tek1systems.com/deduplication/


Mandagere, N., Zhou, P., Smith, M., Uttamchandani, S. (2008). Demystifying Data De-duplicaiton. IBM. Almaden Research Center.

Asaro, T., Biggar, H. (2007). Data De-duplication and Disk-to-Disk Backup Systems. Pdf. Enterprise Strategy Group.