Modern Day Technology Management: Disaster Recovery

Sunday, February 24, 2008

Data Center Outage - Part 2

****Originally Published 6/14/07

Unbelievably, this happened again today, slightly different circumstances.

We took a lot of lessons learned from the first experience, and created a list of action
items. Two of those action items involved an amp meter from the generator to the building so
we could verify we were getting power and hadn’t blown a circuit, and increasing the UPS to
30 minutes. We installed the amp meter and completed a thorough test and inspection of the
generator on Saturday. For some reason – and several were given but none truly could support
the decision – we decided to upgrade the UPS today; mid-week, during business hours. I know,
you’re already sensing the result.

The installation team promised, certified, guaranteed and swore that there would be no power
disruption. They put the UPS in bypass mode so we would receive direct commercial power,
instead of passing the power through the UPS. Everything started around 11 AM. At 11:19 AM,
as they were installing the first strip of UPS (and, incidentally, half of the IT staff was
off to a farewell lunch for an employee), the system “arced” and blew a fuse.

We lost power for exactly one minute – and believe me, the people who made the decisions
repeated the “only one minute” theme multiple times – but it was enough to collapse all of
our servers and reintroduce the process from our prior outage.

A few differences this time, proving we learned some lessons, but needed to re-learned
others:

Though leadership remained calm and methodical, patience and understanding was gone. Two
full DC outages in 12 days resulted in some people being called to the carpet quite
publicly, albeit subtly.

The application managers were much less forgiving this time around, and there were many
“whispered” comments. You only get one “OOOPS!” in technology – and that’s if you’re lucky,
AND had done everything correctly.

We took copious notes, documented everything from last time, created a list of action items
– but in just 12 days, hadn’t completed most of them, formatted the notes, or printed them
so we had them hard copy. Luckily, most were on local hard drives and on our company portal,
which was one of the first sites back up.

We halved the downtime. In the first outage, we were down roughly 5 hours. This time, we had
critical apps back in 45 minutes, all apps back in 2.5 hours.

Ironically, we still plan to hold our FORMAL DR test this weekend and Mon – Wed of next
week. (Thanks to a botched MOSS upgrade - which I need to document - I’ve now done this 3
times for my critical app, so am exempt from the DR testing).

With the last outage, I truly looked at it as a series of unfortunate events. This time, I
anticipate casualties. To what extent, who knows, but we are going through a merger, and
there is redundancy.

I share because while we learn best from our own mistakes, we learn faster by avoiding
others’.

Data Center Meltdown

***Originally Posted 6/1/07

The fun for today was a complete power outage at our data center. As many of you know, a
data center often acts as the central location for providing network services. Our data
center is no different, other than it acts as the central location for providing network
services for corporate based applications (email, voice mail, accounting, etc). Applications
based at our mining locations typically have servers hosted at their location. When the DC
goes offline, corporate cannot work.

A rough, unofficial timeline of events is as follows:

9:20 AM - City power drops at data center
9:20 AM - UPS kicks on
9:21 AM - Backup generator kicks on
9:35 AM - With generator running, entire data center goes offline
9:45 AM - Emergency meeting convened for all IT
9:50 AM - War room established
9:55 AM - Realize that DR phone bridge is non-functional, create new phone bridge
10:00 AM - Determine generator blew a circuit, preventing power from reaching data center
1:30 PM - All critical systems restored.
3:00 PM - All systems finally restored.

The events that transpired left me with a number of observations:

1) Our DR plan was incomplete. The systems crashed during month end close, which made
certain applications absolutely critical for restore. If the crash had occurred mid-month,
different applications would have had a higher priority. We did not have this documented.
Additionally, certain applications on multiple servers required that the servers come back
online in a specific order. This, too, was not documented. Finally, many people did not have
the full list of servers they needed restored to have functioning applications.

2) Smart, dedicated people can rapidly compensate for lack of planning. To have a five hour
system outage with a complete loss of power at the DC is an impressive feat. I believe two
factors contributed to this success: Leadership and Knowledge. Immediately, leadership
convened the right people and, more importantly, set the right priorities - identify the key
applications, identify the servers required to restore those applications, involve the right
people, and communicate the correct message to the correct stakeholders. Individual players
then acted on the priorities and filled in the blanks based on their knowledge.

3) Sometimes, it is possible to do everything right, and still have something go wrong.
Every control put in place did its job, including the switch shutting off when the generator
came on. The switch received a larger current than it could handle and, per design, shut
off. The question as I write this is did we have the correct switch in place, or did the
generator create a larger than expected surge. The one thing we know to do now, is monitor
the power coming from the generator so we can take corrective action (ie, flip the circuit
back) should we not receive current.

The next few weeks will determine how well we actually did. There are still questions of
data integrity, and we have yet to experience whatever the long-term implications are, but
this was as successful as we could hope for. Ironically, we were planning to perform our
first DR test in two weeks. We did the real thing today.

Modern Day Technology Management

Sunday, February 24, 2008

Data Center Outage - Part 2

Data Center Meltdown

Blog Archive