How will data deduplication affect approaches such as
cloud computing and virtualization?
Backing up data without deduplicating it can prove to be very
expensive with the cloud being used for data backup as a service.
With deduplication, once the data is backed up, you are only
backing up changes to the data in all successive backups.
A user would have less data to be backed up and hence ends up
paying less for consuming the cloud storage service—than he
would have without deduplicating it. For a cloud provider, this
means less investment in the cloud from a storage and network point
of view. Deduplication can help cloud providers reduce costs by
almost a factor of 50.
Virtualized environments are now growing in a fashion similar to
data growth. Let us assume there is physical server running 10
virtual machines (VMs) with the same enterprise e-mail application
on each one of them. This creates duplicate data from each of the
10 instances of the application, within that one physical server.
However the bandwidth pipe available for that physical machine does
not grow in proportion to the number of VMs. This can create an
issue when backing up the virtual environment data or when
replicating to a secondary site for disaster recovery.
With deduplication, data going out of a virtual environment can
be deduplicated to the level of that amount going out of a single
non-virtualized physical machine.
Between source and target-based deduplication what would
be most preferred?
Client or source-based
deduplication is most preferred—since it reduces network
requirements and storage costs. Deduplication at the source reduces
the amount of data that needs to be transmitted across, thereby
reducing bandwidth requirements. It reduces the amount of data
going to the cloud storage, thereby reducing storage requirements.
This brings down the overall storage and bandwidth costs.
It will also allow organizations to back up data from remote
offices or sites that are connected to the data center with a very
small bandwidth. Also, it enables adding a higher degree of
encryption since you now have greater bandwidth to make use of and
less actual data to protect.
However sometimes deduplication at the source is difficult if
the client device does not have enough processing capacity, and
since deduplication is a compute-intensive process. Deduplication
can then be done through the media server.
While a little more bandwidth is consumed for sending data with
redundancies from the client to the media server, data is
deduplicated and then further sent to the server, thereby still
keeping the overall bandwidth and storage requirements in
check.
How do you see the backup and recovery scenario
evolving?
One significant trend that is becoming evident on the backup and
recovery scenario is a transformation from data protection to
information management. We believe that the last 10 years have been
data protection-centric. The next 10 years will be about data
availability and management. Backup administrators will get
requests not for restoring files (as in the past) but for restoring
for instance, the 10 last transactions in a database related to a
particular set of orders.