Modern Day Technology Management: technology

Data Center Outage - Part 2

****Originally Published 6/14/07

Unbelievably, this happened again today, slightly different circumstances.

We took a lot of lessons learned from the first experience, and created a list of action
items. Two of those action items involved an amp meter from the generator to the building so
we could verify we were getting power and hadn’t blown a circuit, and increasing the UPS to
30 minutes. We installed the amp meter and completed a thorough test and inspection of the
generator on Saturday. For some reason – and several were given but none truly could support
the decision – we decided to upgrade the UPS today; mid-week, during business hours. I know,
you’re already sensing the result.

The installation team promised, certified, guaranteed and swore that there would be no power
disruption. They put the UPS in bypass mode so we would receive direct commercial power,
instead of passing the power through the UPS. Everything started around 11 AM. At 11:19 AM,
as they were installing the first strip of UPS (and, incidentally, half of the IT staff was
off to a farewell lunch for an employee), the system “arced” and blew a fuse.

We lost power for exactly one minute – and believe me, the people who made the decisions
repeated the “only one minute” theme multiple times – but it was enough to collapse all of
our servers and reintroduce the process from our prior outage.

A few differences this time, proving we learned some lessons, but needed to re-learned
others:

Though leadership remained calm and methodical, patience and understanding was gone. Two
full DC outages in 12 days resulted in some people being called to the carpet quite
publicly, albeit subtly.

The application managers were much less forgiving this time around, and there were many
“whispered” comments. You only get one “OOOPS!” in technology – and that’s if you’re lucky,
AND had done everything correctly.

We took copious notes, documented everything from last time, created a list of action items
– but in just 12 days, hadn’t completed most of them, formatted the notes, or printed them
so we had them hard copy. Luckily, most were on local hard drives and on our company portal,
which was one of the first sites back up.

We halved the downtime. In the first outage, we were down roughly 5 hours. This time, we had
critical apps back in 45 minutes, all apps back in 2.5 hours.

Ironically, we still plan to hold our FORMAL DR test this weekend and Mon – Wed of next
week. (Thanks to a botched MOSS upgrade - which I need to document - I’ve now done this 3
times for my critical app, so am exempt from the DR testing).

With the last outage, I truly looked at it as a series of unfortunate events. This time, I
anticipate casualties. To what extent, who knows, but we are going through a merger, and
there is redundancy.

I share because while we learn best from our own mistakes, we learn faster by avoiding
others’.

Data Center Meltdown

***Originally Posted 6/1/07

The fun for today was a complete power outage at our data center. As many of you know, a
data center often acts as the central location for providing network services. Our data
center is no different, other than it acts as the central location for providing network
services for corporate based applications (email, voice mail, accounting, etc). Applications
based at our mining locations typically have servers hosted at their location. When the DC
goes offline, corporate cannot work.

A rough, unofficial timeline of events is as follows:

9:20 AM - City power drops at data center
9:20 AM - UPS kicks on
9:21 AM - Backup generator kicks on
9:35 AM - With generator running, entire data center goes offline
9:45 AM - Emergency meeting convened for all IT
9:50 AM - War room established
9:55 AM - Realize that DR phone bridge is non-functional, create new phone bridge
10:00 AM - Determine generator blew a circuit, preventing power from reaching data center
1:30 PM - All critical systems restored.
3:00 PM - All systems finally restored.

The events that transpired left me with a number of observations:

1) Our DR plan was incomplete. The systems crashed during month end close, which made
certain applications absolutely critical for restore. If the crash had occurred mid-month,
different applications would have had a higher priority. We did not have this documented.
Additionally, certain applications on multiple servers required that the servers come back
online in a specific order. This, too, was not documented. Finally, many people did not have
the full list of servers they needed restored to have functioning applications.

2) Smart, dedicated people can rapidly compensate for lack of planning. To have a five hour
system outage with a complete loss of power at the DC is an impressive feat. I believe two
factors contributed to this success: Leadership and Knowledge. Immediately, leadership
convened the right people and, more importantly, set the right priorities - identify the key
applications, identify the servers required to restore those applications, involve the right
people, and communicate the correct message to the correct stakeholders. Individual players
then acted on the priorities and filled in the blanks based on their knowledge.

3) Sometimes, it is possible to do everything right, and still have something go wrong.
Every control put in place did its job, including the switch shutting off when the generator
came on. The switch received a larger current than it could handle and, per design, shut
off. The question as I write this is did we have the correct switch in place, or did the
generator create a larger than expected surge. The one thing we know to do now, is monitor
the power coming from the generator so we can take corrective action (ie, flip the circuit
back) should we not receive current.

The next few weeks will determine how well we actually did. There are still questions of
data integrity, and we have yet to experience whatever the long-term implications are, but
this was as successful as we could hope for. Ironically, we were planning to perform our
first DR test in two weeks. We did the real thing today.

Measuring the Right Metrics

***Originally Posted 5/4/07

As I indicated in my previous post, I've worked for a lot of companies in my 11 years in the
IT industry. I've spent most of my career as "the new guy", as is likely to happen when you
find a new home every 18 months or so. As the new guy, I've been in the best position to
recognize nonsensical processes, counterproductive policies, and disincentivizing metrics.

For example, on the surface, tracking how many calls a help desk technician can process in a
day might be an excellent way to demonstrate productivity. On the other hand, it's far
easier for that same person to answer the phone, take the name, guess at the problem, and
move on to the next call than it is to actually provide meaningful assistance. As the person
calling the help desk, would you prefer a quick phone call, or a resolution to your problem?
A more meaningful measurement is customer satisfaction, or time to resolution, or, another
common metric, average wait time after placing a call.

System availability is another spurious metric. Achieving 4 nines, or 5 nines (99.99 and
99.999% up time, respectively) receives a lot of press and, again, on the surface appears a
lofty goal. In a given year, there are 8,766 hours (24 x 365.25 days). If your systems are
available 99.999% of the time, that means they are down 5 minutes per year. With mirrored
redundancy, this is accomplishable, but is it necessary and cost effective? The average
worker only works 2,000 hours a year, 40 hours a week, 8 hours a day, Monday through Friday.
If a system is unavailable from 11 PM Friday until 3 AM Monday, does it really matter? A
more realistic measurement is uptime during work hours. For most applications, we can very
easily, and very cheaply, achieve 99.999% availability during work hours. In fact, many
applications I've managed have achieved 100% uptime during work hours.

Finally, after the debacle of the early 2000's, where IT spend increased rapidly and
business value decreased even faster, our industry began applying metrics to projects. Once
again, do we track the right metrics, and, more importantly, do we place the appropriate
emphasis on them. Three metrics any competent project manager will track during the course
of a project are cost, timing, and deliverables. Basically, is the project team delivering
the defined deliverables on time and on budget. But where do we track if the project
delivered to expectation? Where do we validate that, while technically functional, the
system meets the wants and needs of the user community? I can deliver an application that
has every field a user will ever enter on one page, and that will often meet the "technical"
requirements, but no one will want to use it. Placing the emphasis on the triple
constraints, exclusively, dilutes the value of the project by focusing effort on the wrong
areas.

Technology is at a cross-roads. As an industry, we are suffering the sins of our
forefathers: those well intentioned geeks who believed the myth that technology could solve
every problem, if given enough time and enough money. Their misguided belief has resulted in
leaders of finance, who view the world through "carcass value" lenses, controlling an
environment driven by innovation, risk, and creativity.

The true leaders of today's technology landscape recognize that we provide business value
not by taking orders, cutting costs, and keeping the lights on, but by understanding the
goals of the business, partnering with them, and introducing innovation to help achieve
those goals. Technology professionals earn their high salaries by solving problems the
business can't, not by automating their solutions.

We entered this profession to solve complex problems. Many of us thought those problems
would be life changing and world altering - and some of them are. But the more common
problems, the more abundant problems, the more important problems to our employers, are the problems they face every day that we can solve with two weeks of work. Where are the metrics
for that?

Introduction

I originally started a personal blog, but found I was posting too many entries about my

work. Many of my work postings started as emails, and colleagues began asking me to forward

those emails. Since I've always been one to separate my personal life from my professional

life, I decided to create this blog to post my work related stories, challenges, and

resolutions.

What you'll find:

Honest, from the gut rants about working in the IT

industry.
The good, the bad and the political of trying to move up in

management.
True testimonials of work challenges, failures and

successes

Although my identity may be known, I do not intend to provide names or

call-out colleagues, peers, superiors, subordinates, customers or vendors. My intention is

to provide information that allows the community at large to learn from my and my

organizations' errors and accomplishments, not to damage any individual or organization

reputation. I hope you find my blog useful, entertaining, and pertinent.

Modern Day Technology Management

Sunday, February 24, 2008

Data Center Outage - Part 2

Data Center Meltdown

Measuring the Right Metrics

Introduction

Blog Archive