Sunday, February 24, 2008

Data Center Outage - Part 2

****Originally Published 6/14/07

Unbelievably, this happened again today, slightly different circumstances.

We took a lot of lessons learned from the first experience, and created a list of action
items. Two of those action items involved an amp meter from the generator to the building so
we could verify we were getting power and hadn’t blown a circuit, and increasing the UPS to
30 minutes. We installed the amp meter and completed a thorough test and inspection of the
generator on Saturday. For some reason – and several were given but none truly could support
the decision – we decided to upgrade the UPS today; mid-week, during business hours. I know,
you’re already sensing the result.

The installation team promised, certified, guaranteed and swore that there would be no power
disruption. They put the UPS in bypass mode so we would receive direct commercial power,
instead of passing the power through the UPS. Everything started around 11 AM. At 11:19 AM,
as they were installing the first strip of UPS (and, incidentally, half of the IT staff was
off to a farewell lunch for an employee), the system “arced” and blew a fuse.

We lost power for exactly one minute – and believe me, the people who made the decisions
repeated the “only one minute” theme multiple times – but it was enough to collapse all of
our servers and reintroduce the process from our prior outage.

A few differences this time, proving we learned some lessons, but needed to re-learned
others:

Though leadership remained calm and methodical, patience and understanding was gone. Two
full DC outages in 12 days resulted in some people being called to the carpet quite
publicly, albeit subtly.

The application managers were much less forgiving this time around, and there were many
“whispered” comments. You only get one “OOOPS!” in technology – and that’s if you’re lucky,
AND had done everything correctly.

We took copious notes, documented everything from last time, created a list of action items
– but in just 12 days, hadn’t completed most of them, formatted the notes, or printed them
so we had them hard copy. Luckily, most were on local hard drives and on our company portal,
which was one of the first sites back up.

We halved the downtime. In the first outage, we were down roughly 5 hours. This time, we had
critical apps back in 45 minutes, all apps back in 2.5 hours.

Ironically, we still plan to hold our FORMAL DR test this weekend and Mon – Wed of next
week. (Thanks to a botched MOSS upgrade - which I need to document - I’ve now done this 3
times for my critical app, so am exempt from the DR testing).

With the last outage, I truly looked at it as a series of unfortunate events. This time, I
anticipate casualties. To what extent, who knows, but we are going through a merger, and
there is redundancy.

I share because while we learn best from our own mistakes, we learn faster by avoiding
others’.

No comments: