What is Deduplication?

In information technology, Deduplication, or data deduplication, is a process that identifies redundant data (duplicate detection) and eliminates it before it is written to non-volatile disk. The process compresses data like other methods and hence reduces the amount of data that is sent from a transmitter to a receiver. It is almost impossible to predict the efficiency of deduplication algorithms because their efficiency is dependent on the data structure and the rate of change. Deduplication however is currently the most efficient way to reduce data, where a pattern is observable from backup cycle to backup cycle.

The main area of application for deduplication is backup, where it can practically achieve 1:12 compression rates from cycle to cycle. Deduplication algorithms are essentially useful for every application area where data is copied repeatedly. Deduplication is greatly advantageous for Hyper-V virtual machine backups and database server backups due to the need to perform cyclic backups and due to the block-oriented data structure of such systems.

How it Works

Deduplication systems operate differently than classic compression methods, using only a few pattern matching methods on the so-called “block level”, i.e. files are as divided into a number of blocks of equal size (usually powers of two). Herein also lies the distinction to the Single Instance Storage (SIS), which eliminates identical files (also known as content-addressed storage, CAS).

An important function of de-duplication is the “fingerprinting”. Files are split into segments of varying size (chunks). Files are scanned at the byte level to find out which segments provide the highest rate of repetition, which in turn provides maximum data reduction when using references to the original elements.

For example, when backing up data from disk to tape media there is usually only a relatively low ratio of new or modified to unmodified data between two full backups. Without deduplication, two full backups need still at least twice the storage space on tape. Deduplication detects identical parts in the data set and skips those. These unique segments are recorded in a list, the data blocks are only repeated by reference.

These pointers take up much less space than the referenced byte sequence. When the file is restored, data blocks are only read once and written out multiple times. An index structure indicates which parts are unique and how components are connected in order to recreate the original file again.

However, when deduplication is being used, backups are no longer independent full backups. When an increment is lost, it leads to data loss and the file cannot be restored again.

Methods

 

There are two ways to create a file index. The “reverse referencing ” method stores the first common element and all other identical blocks get a reference to the first. “Forward-Referencing” stores always the most recent shared data block and references the previously encountered items. There is some controversy about whether data can be restored quicker with either of those two methods. Additional processing strategies, such as “in-band” and “out-band” focus on whether parsers process the data stream “on the fly”, or after it has been stored at the destination. In the first case, only one data stream needs to exist. In the latter case the file may be examined in parallel using multiple data streams.

 

Chunking (fingerprinting)

 

Fingerprinting attempts o to determine how the incoming data stream can be disassembled into pieces, to produce as many identical blocks of data as possible. This process is called chunking

 

Identification of Blocks

 Depending on how changes to the file are made and how precisely they can be detected, there will be less redundancy in the backup file. However, the block index complexity increases as well when a complex detection algorithm is being used. It is, therefore, crucial to select the best block identification method to find common blocks depending on the nature of the data.

 

Source

Wikipedia

BackupChain (Backup Software for Windows & Hyper-V offering Deduplication)

 

Backup Software Overview

Server Backup Software
Download BackupChain
Cloud Backup
Backup VMware Workstation
Backup FTP
Backup VirtualBox
Backup File Server

Hyper-V Backup

Backup Hyper-V
  • Step-by-Step Windows 10 Hyper-V Backups
  • 11 Things to Know About Hyper-V Snapshots / Checkpoints
  • How to Back up Hyper-V
  • Popular

    Resources


    Backup Software List
    BackupChain
    Veeam
    Unitrends
    Symantec Backup Exec
    BackupAssist
    Acronis
    Zetta
    Altaro
    Windows Server Backup
    Microsoft DPM
    Ahsay
    CommVault
    IBM

    Other Backup How-To Guides

    Download BackupChain®