A lightning strike in Dublin, Ireland knocked Amazon's European
cloud services offline Sunday and some customers were expected to
be down for up to two days.
Amazon needed disaster recovery capability with live data
replication to be in place for many customers to avoid being caught
in the outage.
Both primary and secondary power supplies were knocked out in
the same lightning strike during an intense electrical storm Sunday
over the city of Dublin, where Amazon operates its European zone
data center. The strike caused a transformer explosion and fire in
the grid of Amazon's electricity supplier; the same strike also
knocked out Amazon's backup generators.
An "electric deviation" caused by the strike traveled along the
power feed wires to knock out the control system that would
normally have triggered backup generators in the data center,
Amazon operators reported in the EC2 cloud's Service Health
Dashboard for European users.
Amazon and other data center operators take precautions to
protect against lightning strikes, said Indu Kodukula, CTO of
SunGard Availability Services, a disaster recovery specialist firm.
But a direct strike on the power supplier's transformer "is a thing
you pray never happens to you," he noted.
The strike also affected a Microsoft data center powering its
Business Productivity Online Suite of applications, according to
DataCenterKnowledge.com, a data center operations site.
Amazon itself explained on its Service Health Dashboard:
"Normally, upon dropping the utility power provided by the
transformer, electrical load would be seamlessly picked up by
backup generators. The transient electric deviation caused by the
explosion was large enough that it propagated to a portion of the
phase control system that synchronizes the backup generator plant,
disabling some of them."
In response to InformationWeek inquiries, Amazon Web Services
said, "We are planning to publish a post mortem with more details,"
much as it did after a misaligned network brought down several EC2
services in its northern Virginia data center over April's Easter
weekend.
To avoid being caught in the European outage, Amazon customers
would have had to take extraordinary measures to protect themselves
before the incident occurred, said Kodukula.
It's still possible that having the ability to fail-over to a
second availability zone within the data center would have saved a
customer's system. Availability zones within an Amazon data center
typically have different sources of power and telecommunications,
allowing one to fail and others to pick up parts of its load. But
not everyone has signed up for failover service to a second zone,
and Amazon spokesman Drew Herdener declined to say whether
secondary zones remained available in Dublin after the primary zone
outage.
In the April outage in Amazon's U.S. East region, cloud services in
secondary zones failed after the primary zone went down, triggering
"a re-mirroring storm." In such an incident, the sudden loss of
access to many users' data causes automated systems to try to
duplicate the data elsewhere, tying up all available resources.
Some companies now employ a form of disaster recovery that
stores a duplicate set of virtual machines at a separate site;
they're started up in the event of failure at the primary site. But
Kodukula said such a process takes several minutes to get systems
started at an alternative site. It also results in loss of several
minutes worth of data.
Another alternative is to set up a data replication system to
feed real-time data into the second site. If systems are kept
running continuously, they can pick up the work of the failed
systems with a minimum of data loss, he said. But companies need to
employ their coordination expertise to make such a system work, and
some data may still be lost.
SunGard and other parties are known to be working on specialized
services in the cloud that will ease the problem of establishing
backup systems and activating them in case of failure. But no such
services have been announced yet.
Source:
InformationWeek USA