Real time or near real-time replication similar to a mirror RAID where block writes are sent over to other media or servers within very short intervals, like seconds, isn’t always a good idea.
Experiments with this technology have shown that it adds challenges for disaster recovery due to the following reasons.
VSS is the core in Windows needed for live backups. VSS ensures that each application has a way to become eligible and compatible with live backup.
A live backup capable application registers with VSS and receives a signal to prepare for live backup. It then prepares its data structures to reach a consistent state, such that it can resume work if rebooted/restarted without much of a data loss beyond the current unfinished transactions at hand.
As you can imagine this is quite an effort since it affects all apps and services, including the operating system as well. That’s why requesting a backup places such stress on servers.
VSS is really the only way to back up properly. Sure other OS have similar names for the same thing, but think about it: each app must prepare or else you won’t have good live backups!
Many CDP (continuous data protection) solutions we have investigated out there don’t use VSS. VSS requires a minimum of a 15 minute break between backups for cleanup to take place.
Even if no time was needed in-between, you wouldn’t want to waste valuable CPU and HDD resources on repeated backup prep work.
VSS can’t and won’t offer true CDP for the above logical reasons.
True CDP, on the other hand, is a risky business, too. Would you call a RAID mirror drive a backup? Applications aren’t aware their transient states are being replicated. Restarting a server or service when its data structures are in an unknown state will likely result in a unrecoverable state and partial or full data loss, unless the structure and application is specifically designed to deal with catastrophic events at every single possible point in time. The fact that nearly no application satisfies this requirement was one of the major reasons why VSS was created in the first place.
True CDP is also risky because near real-time replication involves the potential of packet losses and buffer/temp space overruns. The software needs then to rescan the entire file or volume to ensure it didn’t miss anything. However, the user isn’t aware that s/he relies on the software to detect the potential of corruption. If the software has a bug or otherwise for some reason misses packets for whatever reason without being aware of it, corruption will continue for some time unnoticed until a scan is either scheduled or manually requested. Lots of RAID arrays suffer from this phenomenon, too.
Now some people may think: “What the hell? If you write ABC on a paper why would there be corruption later?” Because we aren’t dealing with paper. Mechanical hard drives and SSDs suffer from bit rot and communication links aren’t perfect either. Bits may flip randomly. And the more bits you have (the bigger the hard drive) the more bits will flip. Bits on the hard drive won’t scream back to the app, “hey look, the block just changed”. If we are lucky there may be a checksum error when reading but the checksums used in file systems and internal HDD structures are very primitive for performance reasons and miss plenty of “lost” bits here and there. In addition, when the app reads the block and discovers the checksum is wrong and that some bits did indeed flip, it’s already too late! The original block of data is gone for good….
Isn’t disaster recovery an interesting subject?