In information technology, Deduplication, or data deduplication, is a process that identifies redundant data (duplicate detection) and eliminates it before it is written to non-volatile disk. The process compresses data like other methods and hence reduces the amount of data that is sent from a transmitter to a receiver. It is almost impossible to predict the efficiency of deduplication algorithms because their efficiency is dependent on the data structure and the rate of change. Deduplication however is currently the most efficient way to reduce data, where a pattern is observable from backup cycle to backup cycle.
The main area of application for deduplication is backup, where it can practically achieve 1:12 compression rates from cycle to cycle. Deduplication algorithms are essentially useful for every application area where data is copied repeatedly. Deduplication is greatly advantageous for Hyper-V virtual machine backups and database server backups due to the need to perform cyclic backups and due to the block-oriented data structure of such systems.
Deduplication systems operate differently than classic compression methods, using only a few pattern matching methods on the so-called “block level”, i.e. files are as divided into a number of blocks of equal size (usually powers of two). Herein also lies the distinction to the Single Instance Storage (SIS), which eliminates identical files (also known as content-addressed storage, CAS).
An important function of de-duplication is the “fingerprinting”. Files are split into segments of varying size (chunks). Files are scanned at the byte level to find out which segments provide the highest rate of repetition, which in turn provides maximum data reduction when using references to the original elements.
For example, when backing up data from disk to tape media there is usually only a relatively low ratio of new or modified to unmodified data between two full backups. Without deduplication, two full backups need still at least twice the storage space on tape. Deduplication detects identical parts in the data set and skips those. These unique segments are recorded in a list, the data blocks are only repeated by reference.
These pointers take up much less space than the referenced byte sequence. When the file is restored, data blocks are only read once and written out multiple times. An index structure indicates which parts are unique and how components are connected in order to recreate the original file again.
However, when deduplication is being used, backups are no longer independent full backups. When an increment is lost, it leads to data loss and the file cannot be restored again.
There are two ways to create a file index. The “reverse referencing ” method stores the first common element and all other identical blocks get a reference to the first. “Forward-Referencing” stores always the most recent shared data block and references the previously encountered items. There is some controversy about whether data can be restored quicker with either of those two methods. Additional processing strategies, such as “in-band” and “out-band” focus on whether parsers process the data stream “on the fly”, or after it has been stored at the destination. In the first case, only one data stream needs to exist. In the latter case the file may be examined in parallel using multiple data streams.
Fingerprinting attempts o to determine how the incoming data stream can be disassembled into pieces, to produce as many identical blocks of data as possible. This process is called chunking
Identification of Blocks
Depending on how changes to the file are made and how precisely they can be detected, there will be less redundancy in the backup file. However, the block index complexity increases as well when a complex detection algorithm is being used. It is, therefore, crucial to select the best block identification method to find common blocks depending on the nature of the data.
BackupChain (Backup Software for Windows & Hyper-V offering Deduplication)