Sunday, February 24, 2008

Data Center Meltdown

***Originally Posted 6/1/07

The fun for today was a complete power outage at our data center. As many of you know, a
data center often acts as the central location for providing network services. Our data
center is no different, other than it acts as the central location for providing network
services for corporate based applications (email, voice mail, accounting, etc). Applications
based at our mining locations typically have servers hosted at their location. When the DC
goes offline, corporate cannot work.

A rough, unofficial timeline of events is as follows:

9:20 AM - City power drops at data center
9:20 AM - UPS kicks on
9:21 AM - Backup generator kicks on
9:35 AM - With generator running, entire data center goes offline
9:45 AM - Emergency meeting convened for all IT
9:50 AM - War room established
9:55 AM - Realize that DR phone bridge is non-functional, create new phone bridge
10:00 AM - Determine generator blew a circuit, preventing power from reaching data center
1:30 PM - All critical systems restored.
3:00 PM - All systems finally restored.

The events that transpired left me with a number of observations:

1) Our DR plan was incomplete. The systems crashed during month end close, which made
certain applications absolutely critical for restore. If the crash had occurred mid-month,
different applications would have had a higher priority. We did not have this documented.
Additionally, certain applications on multiple servers required that the servers come back
online in a specific order. This, too, was not documented. Finally, many people did not have
the full list of servers they needed restored to have functioning applications.

2) Smart, dedicated people can rapidly compensate for lack of planning. To have a five hour
system outage with a complete loss of power at the DC is an impressive feat. I believe two
factors contributed to this success: Leadership and Knowledge. Immediately, leadership
convened the right people and, more importantly, set the right priorities - identify the key
applications, identify the servers required to restore those applications, involve the right
people, and communicate the correct message to the correct stakeholders. Individual players
then acted on the priorities and filled in the blanks based on their knowledge.

3) Sometimes, it is possible to do everything right, and still have something go wrong.
Every control put in place did its job, including the switch shutting off when the generator
came on. The switch received a larger current than it could handle and, per design, shut
off. The question as I write this is did we have the correct switch in place, or did the
generator create a larger than expected surge. The one thing we know to do now, is monitor
the power coming from the generator so we can take corrective action (ie, flip the circuit
back) should we not receive current.

The next few weeks will determine how well we actually did. There are still questions of
data integrity, and we have yet to experience whatever the long-term implications are, but
this was as successful as we could hope for. Ironically, we were planning to perform our
first DR test in two weeks. We did the real thing today.

No comments: