Just a few years ago,
disk-to-disk backup seemed almost too good to be true. Powered by
inexpensive ATA (and later SATA) disk drives, D2D, whether
implemented as virtual tape libraries or as a backup-to-disk option
in your favorite backup application, made backups faster,
eliminated mechanical failures in tape drives and libraries, and
made it easier to deal with the continuous chorus of calls to the
helpdesk for individual file restores.
Today, our disk-backup devices are filling up, and there’s
not enough space or power in the data center to add another
petabyte of backup space, so we’re keeping only two to three
days’ worth of backups on disk, when we’d like to keep
a month’s worth. Problem is, there’s too much duplicate
data in our backup sets. The good news is, vendors—smelling
money, of course—are promising that their new data
de-duplication products can provide 20-to-1, even 300-to-1
reductions in the amount of data we need to store. Can it be?
Let’s take a look.
De-duplication technology lets you store more backup data on a
given set of disks. This can extend the period you keep disk
backups and reduce your data center power and cooling costs. If you
de-dupe data before sending it across the WAN, you can save on
bandwidth, making online off-site backups practical at companies
that used to rely on tape. The only drawback to data de-duplication
is that it can slow down the backup process.
Point of Origin
Duplicate data makes its way into backups across the temporal
realm over time, as your backup program backs up the same file from
the same directory multiple times, or as the same files are backed
up from multiple locations in your network. Most networks have a
surprising amount of duplicate data, from the holiday party
invitation PDF 56 users saved to their home directories to the 3 GB
of Windows files on the system drive of every server.
One solution to file duplication in the temporal realm is
incremental backup. Although we’re big fans of this,
especially the incremental-forever approach used by Tivoli Storage
Manager and others, we don’t consider incremental backups to
be data de-duplication any more than we consider RAID disaster
recovery. Incremental backups fall in the realm of duplicate
avoidance.
The most basic form of data de-duplication is the file-level
single-instance store found in CAS (content-addressable storage)
devices, such as EMC’s Centera. As each file is stored on a
CAS system, the device generates a hash of the file’s
contents; should a file with the same hash already exist, rather
than saving another copy, the system just creates another pointer
to the copy it already has.
Microsoft’s latest version of Windows Storage Server,
the OEM NAS (network-attached storage) version of Windows server,
uses a slightly different approach to eliminating duplicate files.
Rather than identify duplicates as they’re written, WSS runs
a background process, the SIS (single-instance storage) Groveler,
which identifies duplicate files using a partial file hash function
followed by a full binary comparison, moves the file to a common
storage area and replaces the files in their original locations
with links to the file in the common store.
Although file-level SIS can save some space, things get really
interesting if we eliminate not only duplicate files but also
storing data duplicated within the file. Think of Outlook’s
lowly .PST file. A typical user may have a 300-MB or larger .PST
holding all his e-mail from time immemorial; every day he receives
one or more new messages, and since his .PST file is changed that
day, your backup program includes it in the incremental backup even
though there are only 25 KB of changes in the 300-MB file.
A de-duping product that could identify that 25 KB of new data and
store it without the rest of the baggage could save lots of disk
space. Extend that concept so that duplicate data, such as the
550-KB attachment that’s in 20 users’ .PST files, can
be eliminated, and you could achieve staggering data-reduction
factors. One group of such solutions are the data de-duping backup
targets pioneered by Data Domain. These devices look to a backup
application like a VTL (virtual tape library) or NAS device. They
take their data from the backup app and do their de-duplication
magic on it transparently.
Modus Operandi
Vendors have taken three basic approaches to the data
de-duplication process. The hash-based approach, used by Data
Domain, FalconStor Software in its VTL software and Quantum in its
new DXi-series appliances, breaks the data stream from the backup
app into blocks and generates a hash for each block, using SHA-1,
MD-5 or a similar algorithm. If the hash for a new block matches a
hash that’s in the device’s hash index, the data has
already been backed up, and the device just updates its tables to
say the data exists in the new location too.
The hash-based approach has a built-in scalability issue. To
quickly tell if a given block of data has been backed up, it should
hold the hash index in memory. As the number of backed-up blocks
grows, so does the index. Once the index grows beyond the
device’s ability to hold it in memory, performance falls off,
as disk searches are much slower than memory searches. As a result,
most hash-based systems are self-contained appliances balancing the
amount of memory with the amount of disk space for storing data so
the hash table never grows too big.
The second approach, content-aware de-duplication, relies on the
backup appliance being aware of the data format it’s
recording. It can use the file-system metadata embedded in the
backup data to identify files; it then does byte-by-byte
comparisons with other versions in its data repository to create a
delta file of the changes in this version compared with the first
version stored. This approach avoids the possibility of a hash
collision (see “Don’t Fear Collisions,” below),
but requires the use of a supported backup app so the device can
extract metadata.
ExaGrid Systems’ InfiniteFiler is an example of a
content-aware de-duplication device that uses its knowledge of the
common backup apps like CommVault Galaxy and Symantec Backup Exec
to identify files from the source system as they’re backed
up. After the backup is completed, it identifies files that
have been backed up multiple times and generates deltas. Multiple
InfiniteFilers can be combined into a grid supporting up to 30 TB
of backup data. The de-duping approach ExaGrid uses does a good job
of storing the one new message in a 1-GB .PST file but it
can’t eliminate duplicate data across multiple different
files, like the same attachment in four .PSTs.
Sepaton’s DeltaStor for its VTLs also uses the content-aware
approach, but compares the new file with both previous versions
from the same location and with versions backed up from other
locations so it can eliminate geographical duplicates.
The third approach, used by Diligent Technologies in its
ProtecTier VTL, divides data into blocks like the hash-based
products but uses a proprietary algorithm to determine if given
blocks are similar to one another. It then does a byte-by-byte
compare of the data in similar blocks to determine if the block has
been backed up.
Hardware or Software
In addition to their de-duping approach, backup targets differ
in their physical architectures. Data Domain, ExaGrid and Quantum
make monolithic appliances that contain their disk arrays. The Data
Domain and Quantum appliances can have NAS or VTL interfaces, while
ExaGrid is always a NAS. Diligent and FalconStor sell their
products as software, running on an Intel or Opteron server, to
create a VTL gateway to external storage.
Although a backup appliance with a VTL interface may seem more
sophisticated and could be easier to integrate into an existing
tape-based backup environment, using a NAS interface gives your
backup application more control over virtual media management. When
a backup file reaches the end of its retention period, some backup
apps, including Symantec’s NetBackup, can delete the file
from their disk repository. When a de-duping NAS appliance sees the
deletion, it can re-allocate its free space and hash index. Since
you don’t delete tapes, there’s no way to release space
on a VTL until the virtual tape is overwritten.
Of course, there is a price to pay for fitting 25 TB of data in a
1-TB bag, and not just in dollars. All the work of slicing your
data into chunks and indexing it to remove the duplicates does slow
things down more than just a little. A midrange VTL like an
Overland REO 9000 can back up data at 300 MBps or better. Diligent
has been able to achieve 200-MBps backup rates on its ProtecTier in
third-party benchmarks, but that required a quad Opteron
server front-ending an array of more than 100 disk drives.
Other vendors address the problem by de-duping the data as a
separate process that runs after the backup. On a system running
FalconStor’s VTL software, data is written from the backup
app to a compressed but not de-duped virtual tape file. Then a
background process chunks the data, removes the duplicates and
creates a virtual virtual tape that is an index of which de-duped
data blocks were on the original virtual tape. Once the data from a
virtual tape is de-duped, the space it occupied is returned to the
available space pool. Sepaton’s DeltaStor and ExaGrid also
perform their de-duping as a post-backup process.
Although post-processing can boost backup speeds, it has its own
costs. A system that does post-process de-duping must have enough
disk space to hold a full set of standard backups in addition to
its de-duped data. If you’re looking to keep to a weekly
full/daily incremental backup schedule, you may need a couple times
more disk space on a system that de-dupes in the background to hold
those full backups until it can digest them.
Just because the de-duping is running in the background,
don’t ignore de-duping performance. If your VTL hasn’t
finished digesting the weekend’s backups by the time you
start backing up your servers again on Monday night, you may not be
happy with the results. Disk space may not be available or the
de-duping process may slow down your backups.
Bandwidth Conservation
Saving disk space on a backup appliance isn’t the only
application of subfile de-duping technology. A new generation of
backup applications, including Asigra’s Televaulting,
EMC’s Avamar Axion and Symantec’s NetBackup PureDisk,
use hash-based data de-duplication to reduce the bandwidth needed
to send backups across a WAN.
First, like any conventional backup application making an
incremental backup, these use the usual methods like archive bits,
last-modified dates and the file system change journal to ID the
files that have changed since the last backup. They then slice,
dice and julienne the file into smaller blocks and calculate hashes
for each block.
The hashes are then compared with a local cache of the hashes of
blocks that have been backed up at the local site. The hashes that
don’t appear in the local cache and file system metadata are
then sent to the central backup server, which compares the data
with its hash tables. The backup server sends back a list of the
hashes that it hasn’t seen before; the server being backed up
then sends the data blocks represented by those hashes to the
central server for safekeeping.
These backup solutions could reach even higher data-reduction
levels than the backup targets by de-duplicating not just the data
from the set of servers that are backed up to a single target or
even a cluster of targets but across the entire enterprise. If the
CEO sends a 100 MB PowerPoint presentation to all 500 branch
offices, it will be backed up from the one whose backup schedule
runs first. All the others will just send hashes to the home office
and be told, “We already got that, thanks.”
This approach is also less susceptible to the scalability
issues that affect hash-based systems. Since each remote
server only caches the hashes for its local data, that hash
table shouldn’t outgrow available space, and since the disk
I/O system at the central site is much faster than the WAN feeding
the backups, even searching a huge hash index on disk is much
faster than sending the data.
Although Televaulting, Avamar Axion and NetBackup PureDisk all
share a similar architecture and are priced based on the size of
the de-duplicated data store, there are some differences. NetBackup
PureDisk uses a fixed 128-KB block size, whereas Televaulting and
Avamar Axion use variable block sizes, which should result in
greater de-duplication. PureDisk can be managed from NetBackup, and
Symantec promises greater integration in the future, which we hope
means de-duplication integrated into data center backup jobs.
Asigra also markets Televaulting for service providers so small
businesses that don’t want to set up their own infrastructure
can take advantage of de-duplication too.
Backup targets, including FalconStor’s VTL, Quantum’s
DXi series and Data Domain’s appliances that can replicate
data after it has been de-duped, can see the same kind of bandwidth
reductions for branch data center off-site backups and disaster
recovery of applications that don’t require real-time
replication.
Data de-duplication is here to stay for at least a while. We spoke
to several users who report they really do get 20-to-1 and greater
data-reduction factors without making major changes to their backup
processes. Small organizations can use the new-generation backup
programs from Asigra, EMC and Symantec to replace their
conventional backup solutions. Midsize organizations can use backup
targets in the data center. Large enterprises with very high backup
performance needs may have to wait for the next generation.