What is Deduplication?
In information technology, Deduplication, or data deduplication, is a process that identifies redundant data (duplicate detection) and eliminates it before it is written to non-volatile disk. The process compresses data like other methods and hence reduces the amount of data that is sent from a transmitter to a receiver. It is almost impossible to predict the efficiency of deduplication algorithms because their efficiency is dependent on the data structure and the rate of change. Deduplication however is currently the most efficient way to reduce data, where a pattern is observable from backup cycle to backup cycle.
The main area of application for deduplication is backup, where it can practically achieve 1:12 compression rates from cycle to cycle. Deduplication algorithms are essentially useful for every application area where data is copied repeatedly. Deduplication is greatly advantageous for Hyper-V virtual machine backups and database server backups due to the need to perform cyclic backups and due to the block-oriented data structure of such systems.
How it Works
Deduplication systems operate differently than classic compression methods, using only a few pattern matching methods on the so-called “block level”, i.e. files are as divided into a number of blocks of equal size (usually powers of two). Herein also lies the distinction to the Single Instance Storage (SIS), which eliminates identical files (also known as content-addressed storage, CAS).
An important function of de-duplication is the “fingerprinting”. Files are split into segments of varying size (chunks). Files are scanned at the byte level to find out which segments provide the highest rate of repetition, which in turn provides maximum data reduction when using references to the original elements.
For example, when backing up data from disk to tape media there is usually only a relatively low ratio of new or modified to unmodified data between two full backups. Without deduplication, two full backups need still at least twice the storage space on tape. Deduplication detects identical parts in the data set and skips those. These unique segments are recorded in a list, the data blocks are only repeated by reference.
These pointers take up much less space than the referenced byte sequence. When the file is restored, data blocks are only read once and written out multiple times. An index structure indicates which parts are unique and how components are connected in order to recreate the original file again.
However, when deduplication is being used, backups are no longer independent full backups. When an increment is lost, it leads to data loss and the file cannot be restored again.
Methods
There are two ways to create a file index. The “reverse referencing ” method stores the first common element and all other identical blocks get a reference to the first. “Forward-Referencing” stores always the most recent shared data block and references the previously encountered items. There is some controversy about whether data can be restored quicker with either of those two methods. Additional processing strategies, such as “in-band” and “out-band” focus on whether parsers process the data stream “on the fly”, or after it has been stored at the destination. In the first case, only one data stream needs to exist. In the latter case the file may be examined in parallel using multiple data streams.
Chunking (fingerprinting)
Fingerprinting attempts o to determine how the incoming data stream can be disassembled into pieces, to produce as many identical blocks of data as possible. This process is called chunking
Identification of Blocks
Depending on how changes to the file are made and how precisely they can be detected, there will be less redundancy in the backup file. However, the block index complexity increases as well when a complex detection algorithm is being used. It is, therefore, crucial to select the best block identification method to find common blocks depending on the nature of the data.
Source
Wikipedia
BackupChain (Backup Software for Windows & Hyper-V offering Deduplication)
Backup Software Overview
BackupChain Server Backup SoftwareDownload BackupChain
Cloud Backup
Backup VMware Workstation
Backup FTP
Backup VirtualBox
Backup File Server
Hyper-V Backup
Backup Hyper-VPopular
- Hyper-V Links, Guides, Tutorials & Comparisons
- Veeam Alternative
- How to Back up Cluster Shared Volumes
- DriveMaker: Map FTP, SFTP, S3 Site to a Drive Letter (Freeware)
Resources
- Free Hyper-V Server
- Remote Desktop Services Blog
- SCDPM Blog
- SCOM Blog
- V4 Articles
- Knowledge Base
- FAQ
- Sitemap
- Backup Education
- Backup Sichern
- Hyper-V Scripts in PowerShell
- FastNeuron
- BackupChain (Greek)
- BackupChain (Deutsch)
- BackupChain (Spanish)
- BackupChain (French)
- BackupChain (Dutch)
- BackupChain (Italian)
Backup Software List
BackupChain
Veeam
Unitrends
Symantec Backup Exec
BackupAssist
Acronis
Zetta
Altaro
Windows Server Backup
Microsoft DPM
Ahsay
CommVault
IBM
Other Backup How-To Guides
- VssDiag Volume Shadow Copy Service Diagnostic Software with Hyper-V Support
- Get All VHDX for All VMs with this PowerShell Script
- How to Fix: MSMQ Writer (MSMQ) failed
- Windows Server 2022 ISO Final Release Free Download
- Hyper-V Stop 0x0000000A BSOD Error Causes and Fixes KB2776366
- Hyper-V Backup Links, Guides, Tutorials & Comparisons
- Free Disk2VHD Hyper-V VHDX Physical to Virtual Conversion P2V
- Windows Server 2012 R2 New Features & What’s New in Server 2012 R2
- 8 Tape Backup Disadvantages and Issues You Need To Know
- Windows 8 Client Hyper-V Limitations, Intro, and Pitfalls
- How to Backup a Hyper-V VM Remotely Online
- 0x8004230f VSS_E_UNEXPECTED_PROVIDER_ERROR VSS snapshot creation failed
- Windows Server 2016 Download Location ISO File
- What’s New In Windows Server 2012 and R2?
- How to Easily Move VHD / VHDX to New Server, Disk, NAS, Cloud
- How to Convert VHD Files to VHDX Disks in Hyper-V
- Windows Server 2012 R2 and Windows 8.1 Backup Software
- Hyper-V VHD or VHDX? Advantages, Limitations, and Disadvantages
- Fix for Error 0x80780049, Backup ID 517
- Hyper-V Backup for Cluster Shared Volumes