Friday, February 29, 2008

Documents Please!

I had an interesting discussion with one of our architects today that I've had on several prior occasions with Business Analysts, Managers, Developers and Business Users. The conversation centered around how to properly document the business and technical requirements for a project and an application. Every organization I've worked for has had their own documentation requirements, all were very proud of what they had, and every employee preferred what they had used at their prior organization.

I derive my perspective on documenation from a few basic principles:

1) The business users are the customers - the reasons for developing the application. The more difficult the document is for them to understand, the less value the document adds to the end result.

2) You must be able to trace a business requirement from definition through design. If you are unable to trace each requirement through to design, or you cannot trace a design element back to a requirement, your documents add no value.

3) Each step in the documentation process should add incremental value. If you are simply restating the same information in a different format, the do not add value.

4) Speed to market is king. Do not create documentation for the sake of documentation and recognize that the level of documentation required varies by size and complexity of project.

5) Technical documentation is useful in building an application but add no value in supporting an application. Business documentation - process flows, requirements, etc - add value to both.

6) Format is less important than clarity. A proejct team should use whatever format that adds clarity for everyone while still adhering to the basic tenets of SDLC - define it, design it, build it, test it, deploy it.

The discussion today centered around items 1 and 5. For projects I've managed and successfully delivered, I've chosen to limit the product documentation to:

  • Business Requirements
  • Process Flows (typically Swim Lanes)
  • Screen Shots

If the project is large and/or complex, I add:

  • Database Diagram (ERD)
  • Class Definitions

While I see the value in Use Cases, UML and Sequence Diagrams when building applications where errors mean death (ie - NASA, Aeronautics) or major financial risk, I've found that documenting to that level of detail for a custom, corporate application adds unnecessary overhead to a project and adds little to no value going forward.

As far as technical documents post production, I pose this question:

If you are troubleshooting a problem, and you can only choose one, would you prefer access to the source code in debug mode, or access to the application documentation?

The only person I've ever posed this question to who has chosen the latter is the person I spoke with today. That doesn't make him wrong, just unique in my experience.

How do you document projects?

Sunday, February 24, 2008

The problem with contractors

Like most technology managers, I often need the services of outside parties to back fill for
people on leave, provide overflow support on projects, or to provide short-term expertise in
areas my team has none. Contractors, which I differentiate from consultants, have played
integral roles in a number of key projects for me over the years. Typically, they are
bright, motivated, and professional, knowing that without a permanent gig, their reputation
determines if they get a next job.

Recently, I had the worst experience I've ever had with a contractor. So bad, in fact, I
alerted every agency I've worked with that sending this individual out as a representative
of their organization would only diminish their reputation. I've never felt compelled to do
that in any prior experience - even with people I've fired for non-performance.

For those interested in the full story, read the details below. The short story is the
contractor was in over his head from the beginning and didn't recognize it. I checked in
every day - and even asked to see the work - to see how it was progressing and always left
assured that everything was on track. When he delivered it to me for testing, it was an
embarrassment. We finally got the system working at 1:30 AM, two days before the delivery
date, and delivered it for testing. The system went live on time, and functioned as
required, so the business users never felt the impact - though my team certainly did.
Then it got worse.

After letting the contractor go, and informing the agency of the challenges, he started
spamming myself and my superiors with emails slandering the intelligence and professionalism
of me and my staff. He claimed the people who made his code actually function had sabotaged
his efforts. It got so bad his agency had to threaten him with legal and police action
before he would stop harassing me, my staff, and my management.
Suffice it to say, I hope he picks a new industry to work in and I pity an organization
unfortunate enough to hire him.

Lessons Learned:

1) Never push a high-profile, high-urgency project to a contractor. Bring them in to back
fill for the staff that will do the project.
2) Good partners can still provide bad service, and you need to be prepared for that.
3) Involving senior people at the beginning (which I did) is imperative if you need to call
on them to save you at the end. Their familiarity with the project will save valuable time.

THE FULL STORY:

The Background

One of our business units poorly planned the roll-out of a new process they needed
implemented by December 1st and contacted me on November 15th to request an automated
solution. I took their requirements, worked with my team to map out a solution, and provided
a realistic estimate and timeline - December 3rd for testing. They reiterated production
release by December 1, and threw the weight of our Chief Administrative Officer behind the
request, so we now had a new delivery date of November 26th to allow for testing.
I called a firm I have had repeated success (and continued success after this incident),
provided them the technical requirements, and asked them to have someone to me at the
contracted rate the next day for a quick phone screen and a start date two days later - the
Friday before Thanksgiving. In the phone screen, I explained the project, asked a few basic
questions regarding past projects, technical strengths, and dealing with pressure
situations, and agreed to hire him for the job.

Trouble Early

The day he began, I explained the project, went through the detailed design, introduced two
resources at his disposal for assistance and questions (one technical, one a PM), and then
left him to work. I had off site meetings the week of Thanksgiving so I was checking in by
phone at the end of each day. On Monday (day 2 of the project), my PM informed me that he
had to explain to the contractor how to work with data grids and, at one point, sat down and
wrote the initial code for him to get the grids working. I spoke with the contractor who
indicated he had initial issues, but thanks to the PM was now through them and was confident
in hitting the date. On Wednesday, before leaving for Thanksgiving, I checked in once again
and everything was "complete" except for one SQL routine for determining hierarchy. We
agreed I would get the code to test on Monday afternoon, when we returned from the holidays.
On Monday at noon, the SQL routine was not done. The contractor said it would be a few
hours. At the end of the day, I said he needed to stay until it was complete and that
another member of my staff would stay with him to assist. At 9 PM, I received a link to
begin testing.

The Test

Monday evening, I tested what he called "complete" and found an issue with the first user
account - I couldn't log in and received an ugly .NET error. I tried another user - same
issue. The third user finally could access the system, but there were more issues. In all, I
found 14 issues on a small project that had two screens - a login screen and a data entry
screen with nested data grids. Many issues were basic - password was clear text instead of
in asterisks, users couldn't log in, the footer was at the top of the screen instead of the
bottom.

With four days left and a non-functioning product, I asked to borrow an architect from
another team to help close out the project. I called a meeting with the architect, the
contractor, and a junior member on my staff who was doing production support and reviewed
the issues. We devised a plan to resolve them and they estimated a delivery of Tuesday
afternoon.

Situation Worsens

At 2PM, the architect informs me that the contractor seems clueless. He has named the two
data-grids DataGrid1 and DataGrid2 - with DataGrid2 as the primary grid, and DataGrid1 the
nested grid. He said he felt it would be mostly done by end of day with some minor fixes
early the next morning. On Wednesday, they tell me they are finishing the last issue and
I'll have code by noon. At noon, the architect informs me that testing is not going well.
After further investigation, we discover that the contractor used three different data types
(float, int, and text) to represent the same field. Half of the errors we were encountering
were a result of his code casting the data back and forth between the data types.
At 5PM, I told everyone to plan on a long night and asked them to pick what they wanted for
dinner. At 10 PM, I sent the contractor home. At 1AM, we finally finished and had an
incomplete product, but one we could hand over to the users for testing and finish the
following morning. Completing meant a trigger on one table to audit data changes, and a
report they wouldn't use for at least a week.

The Attack

The next morning, I informed the contractor that I could not trust him to work on the other
items I had originally hired him for. I told him that I felt he knew he needed help early
on, and didn't ask for it, and that's why I couldn't trust him. He said he disagreed but was
happy I was letting him go. I then informed his agency that he didn't cut it, and that I
felt it would be in their best interest to not sell him as a lead or put him on any solo C#
projects.

Shortly after that conversation, I received an email from the contractor extolling his own
virtues - even referring me to the strength of his resume - and explaining that the
architect and member of my staff that had saved the project were actually incompetent,
amateurs, and would only hurt my team in the long run. I thanked him for his input and
wished him luck.

He then responded that I had no concept of SDLC, project management, or of development best
practices. He followed that email with one to my boss's boss, explaining that my team was
incompetent, that I hired sycophants, and that he should subject us all to testing so he
could gain a true appreciation of our incompetence.

I, of course, forwarded each of his emails to the agency who had sent him. They were
apologetic, but never offered to refund his fee. They did eventually threaten him with a
lawsuit and a visit from the police, and that seemed to resolve the issue.

What would you have done differently?

Cable Company Issues

Seven years ago, I had a mysterious issue with my cable reception that caused it to go from
crystal clear to horrible snow for 90 minutes from 5:30 to 7 PM every night. Technicians
made multiple visits and verified that, for some reason, I had reduced signal strength - but
they could not resolve it. Eventually, I gave up waiting for a resolution, and switched to
satellite.

Fast forward six years of loving my satellite system, and they do something unthinkable -
degrade picture quality. In an attempt to push more channels on the same frequency, they
double up the TV channels they send on one Satellite channel, and picture quality gets so
bad that reds and flesh tones are blurry with motion and I can actually see pixelation. Once
again, I give the technicians a chance to resolve it, and even upgrade my equipment - all to
no avail.

Then, cable companies start bundling services - internet, cable, phone for under $100 a
month. Well, that's less than I'm paying by using three companies, so I decide to return to
cable. That's when I remembered why I left in the first place.

It starts out okay - phone number transfers, internet is fast and reliable, picture quality
is great - but I'm leery, so I keep satellite and pay for both, going with only basic cable.
I finally decide to make the switch in full, schedule the complete install, and the problems
begin.

First, the technician comes out to install a few more jacks, and move existing satellite
jacks to cable - and doesn't finish the job. He informs me that OSHA prevents him from going
in the attic after 10 AM because it is too hot. So, he leaves with the job unfinished,
telling me to reschedule an AM appointment. Worse, he disconnects one TV that was working
with cable and disables satellite. I spend a few hours that evening in 100 degree heat
outside repairing his mistake.

Second, I lose everything - phone, cable, internet - inexplicably for about half a day. Now
I'm kicking myself for not at least having different internet and phone. At least I hadn't
disconnected the satellite yet so I was still getting TV. I call the cable company and ask
for reimbursement. After all, I have to pay them if I'm not using it, so I want them to pay
me when I want to use it and can't. They tell me they only do refunds if the outage is for
at least 24 hours, and they won't give me one.

Not satisfied with that answer, I ask to speak to a manager. He repeats the same, company
line. I tell him that I left seven years ago for bad service and I'm about to do it again.
He gives me a one day refund - $6.24. Placated but not satisfied, I schedule the next
appointment for a Sunday morning at 8 AM.

Sunday morning comes, 8 AM passes, no cable guy. 8:30 AM, cable guy calls and says he'll be
late, probably around 11. I remind him that this is an attic job and he's not allowed in
after 10 AM. We discuss it and determine I'll do the little bit that needs to be done in the
attic, he'll do the rest.

1:30 and he finally arrives. We get everything setup, test to make sure I've got a good
cable feed to the rooms I need, and he leaves. I start scanning through the channels and
discover that some of them are not getting a strong enough signal. Turns out, the first guy
had spliced the cable in the wall, which distorts the signal, so now neither my digital
cable nor my new internet hookup would work. I end up recabling the two jacks and the feed
from the attic, and now everything works.

So, why would I share this story on a blog for application managers? Because in the
corporate world, application teams are the cable company, trying to deliver what the
business customer needs. When we make mistakes, they get as exasperated and frustrated with
us as I did with the cable comapny. I've received irate phone calls from business customers
who don't care that the server has crashed, they just know the application isn't available.
When I get those calls, I think about how I wanted the cable company to treat me when I was
having a problem and try to respond in kind.

Data Center Outage - Part 2

****Originally Published 6/14/07

Unbelievably, this happened again today, slightly different circumstances.

We took a lot of lessons learned from the first experience, and created a list of action
items. Two of those action items involved an amp meter from the generator to the building so
we could verify we were getting power and hadn’t blown a circuit, and increasing the UPS to
30 minutes. We installed the amp meter and completed a thorough test and inspection of the
generator on Saturday. For some reason – and several were given but none truly could support
the decision – we decided to upgrade the UPS today; mid-week, during business hours. I know,
you’re already sensing the result.

The installation team promised, certified, guaranteed and swore that there would be no power
disruption. They put the UPS in bypass mode so we would receive direct commercial power,
instead of passing the power through the UPS. Everything started around 11 AM. At 11:19 AM,
as they were installing the first strip of UPS (and, incidentally, half of the IT staff was
off to a farewell lunch for an employee), the system “arced” and blew a fuse.

We lost power for exactly one minute – and believe me, the people who made the decisions
repeated the “only one minute” theme multiple times – but it was enough to collapse all of
our servers and reintroduce the process from our prior outage.

A few differences this time, proving we learned some lessons, but needed to re-learned
others:

Though leadership remained calm and methodical, patience and understanding was gone. Two
full DC outages in 12 days resulted in some people being called to the carpet quite
publicly, albeit subtly.

The application managers were much less forgiving this time around, and there were many
“whispered” comments. You only get one “OOOPS!” in technology – and that’s if you’re lucky,
AND had done everything correctly.

We took copious notes, documented everything from last time, created a list of action items
– but in just 12 days, hadn’t completed most of them, formatted the notes, or printed them
so we had them hard copy. Luckily, most were on local hard drives and on our company portal,
which was one of the first sites back up.

We halved the downtime. In the first outage, we were down roughly 5 hours. This time, we had
critical apps back in 45 minutes, all apps back in 2.5 hours.

Ironically, we still plan to hold our FORMAL DR test this weekend and Mon – Wed of next
week. (Thanks to a botched MOSS upgrade - which I need to document - I’ve now done this 3
times for my critical app, so am exempt from the DR testing).

With the last outage, I truly looked at it as a series of unfortunate events. This time, I
anticipate casualties. To what extent, who knows, but we are going through a merger, and
there is redundancy.

I share because while we learn best from our own mistakes, we learn faster by avoiding
others’.

Data Center Meltdown

***Originally Posted 6/1/07

The fun for today was a complete power outage at our data center. As many of you know, a
data center often acts as the central location for providing network services. Our data
center is no different, other than it acts as the central location for providing network
services for corporate based applications (email, voice mail, accounting, etc). Applications
based at our mining locations typically have servers hosted at their location. When the DC
goes offline, corporate cannot work.

A rough, unofficial timeline of events is as follows:

9:20 AM - City power drops at data center
9:20 AM - UPS kicks on
9:21 AM - Backup generator kicks on
9:35 AM - With generator running, entire data center goes offline
9:45 AM - Emergency meeting convened for all IT
9:50 AM - War room established
9:55 AM - Realize that DR phone bridge is non-functional, create new phone bridge
10:00 AM - Determine generator blew a circuit, preventing power from reaching data center
1:30 PM - All critical systems restored.
3:00 PM - All systems finally restored.

The events that transpired left me with a number of observations:

1) Our DR plan was incomplete. The systems crashed during month end close, which made
certain applications absolutely critical for restore. If the crash had occurred mid-month,
different applications would have had a higher priority. We did not have this documented.
Additionally, certain applications on multiple servers required that the servers come back
online in a specific order. This, too, was not documented. Finally, many people did not have
the full list of servers they needed restored to have functioning applications.

2) Smart, dedicated people can rapidly compensate for lack of planning. To have a five hour
system outage with a complete loss of power at the DC is an impressive feat. I believe two
factors contributed to this success: Leadership and Knowledge. Immediately, leadership
convened the right people and, more importantly, set the right priorities - identify the key
applications, identify the servers required to restore those applications, involve the right
people, and communicate the correct message to the correct stakeholders. Individual players
then acted on the priorities and filled in the blanks based on their knowledge.

3) Sometimes, it is possible to do everything right, and still have something go wrong.
Every control put in place did its job, including the switch shutting off when the generator
came on. The switch received a larger current than it could handle and, per design, shut
off. The question as I write this is did we have the correct switch in place, or did the
generator create a larger than expected surge. The one thing we know to do now, is monitor
the power coming from the generator so we can take corrective action (ie, flip the circuit
back) should we not receive current.

The next few weeks will determine how well we actually did. There are still questions of
data integrity, and we have yet to experience whatever the long-term implications are, but
this was as successful as we could hope for. Ironically, we were planning to perform our
first DR test in two weeks. We did the real thing today.

Measuring the Right Metrics

***Originally Posted 5/4/07

As I indicated in my previous post, I've worked for a lot of companies in my 11 years in the
IT industry. I've spent most of my career as "the new guy", as is likely to happen when you
find a new home every 18 months or so. As the new guy, I've been in the best position to
recognize nonsensical processes, counterproductive policies, and disincentivizing metrics.

For example, on the surface, tracking how many calls a help desk technician can process in a
day might be an excellent way to demonstrate productivity. On the other hand, it's far
easier for that same person to answer the phone, take the name, guess at the problem, and
move on to the next call than it is to actually provide meaningful assistance. As the person
calling the help desk, would you prefer a quick phone call, or a resolution to your problem?
A more meaningful measurement is customer satisfaction, or time to resolution, or, another
common metric, average wait time after placing a call.

System availability is another spurious metric. Achieving 4 nines, or 5 nines (99.99 and
99.999% up time, respectively) receives a lot of press and, again, on the surface appears a
lofty goal. In a given year, there are 8,766 hours (24 x 365.25 days). If your systems are
available 99.999% of the time, that means they are down 5 minutes per year. With mirrored
redundancy, this is accomplishable, but is it necessary and cost effective? The average
worker only works 2,000 hours a year, 40 hours a week, 8 hours a day, Monday through Friday.
If a system is unavailable from 11 PM Friday until 3 AM Monday, does it really matter? A
more realistic measurement is uptime during work hours. For most applications, we can very
easily, and very cheaply, achieve 99.999% availability during work hours. In fact, many
applications I've managed have achieved 100% uptime during work hours.

Finally, after the debacle of the early 2000's, where IT spend increased rapidly and
business value decreased even faster, our industry began applying metrics to projects. Once
again, do we track the right metrics, and, more importantly, do we place the appropriate
emphasis on them. Three metrics any competent project manager will track during the course
of a project are cost, timing, and deliverables. Basically, is the project team delivering
the defined deliverables on time and on budget. But where do we track if the project
delivered to expectation? Where do we validate that, while technically functional, the
system meets the wants and needs of the user community? I can deliver an application that
has every field a user will ever enter on one page, and that will often meet the "technical"
requirements, but no one will want to use it. Placing the emphasis on the triple
constraints, exclusively, dilutes the value of the project by focusing effort on the wrong
areas.

Technology is at a cross-roads. As an industry, we are suffering the sins of our
forefathers: those well intentioned geeks who believed the myth that technology could solve
every problem, if given enough time and enough money. Their misguided belief has resulted in
leaders of finance, who view the world through "carcass value" lenses, controlling an
environment driven by innovation, risk, and creativity.

The true leaders of today's technology landscape recognize that we provide business value
not by taking orders, cutting costs, and keeping the lights on, but by understanding the
goals of the business, partnering with them, and introducing innovation to help achieve
those goals. Technology professionals earn their high salaries by solving problems the
business can't, not by automating their solutions.

We entered this profession to solve complex problems. Many of us thought those problems
would be life changing and world altering - and some of them are. But the more common
problems, the more abundant problems, the more important problems to our employers, are the problems they face every day that we can solve with two weeks of work. Where are the metrics
for that?

The myth of job security

****Originally Posted 11/22/06

My career at a glance:

1996
Jul - Begin working part time for a small management consulting firm with plans to go full time after graduation. I didn't seek other opportunities.
Dec - Graduate

1997
Feb
- Management Consulting firm closes its doors
Mar - Work for a temp agency doing menial clerial work
Jun - Join a large national bank as a Project Management Intern on a systems conversion,
supposed to convert to full time at end of project
Oct - Project ends, told to find new job because of merger with another large, national bank
Nov - Join competitor west coast bank as an analyst, begin programming. I consider this my first real salary

1998
Dec - Join a small cooperative advertising management firm as a PM/Developer - 30% pay increase

2000
Sep
- Join a technology consulting firm as a Technology PM Consultant. Placed at a direct marketing firm - 15% pay increase

2001
Throughout year
- tech bubble bursts, corporate fraud (Enron, Worldcom), 9/11
Aug - Consulting firm does a Reduction In Force (RIF) because Enron left $2MM in invoices unpaid - I survive
Oct - Consulting firm reduces salaries 10%

2002
May - The direct marketing company hires me as an employee - 30% pay increase
Sep - HP and Compaq merge. HP was a major customer, so merger elminates 40% of direct marketing firm's revenue
Oct - I buy a house close to my new employer

2003 (a rough year to get married!)
Jan
- Direct marketing firm RIFs 20% of workforce - I'm let go
Jan (two weeks later) - Start work at a small trust company as a Technology PM - 10% paycut
May - The trust company hits front page news as cause of mutual fund scandal
Jul - The trust company lets go contractors
Sep - The trust company RIFs 20% of workforce - I select who leaves from technology
Oct - CEO is fired by board; OCC tells the trust company it can no longer exist
Dec - The trust company bought by another trust company, forming a new organization

2004
Feb - Promoted to Manager, Application Development, 10% raise (back to my salary at the direct marketing company)

2005
Jun - After outlasting 4 CEOs, 2 CIOs, and 3 COOs, leave the trust company for one of the Big 5 technology organization as a Technology PM - 20% pay increase
Sep - Project I'm on stops due to disagreement between VP and SVP

2006 (a rough year to have a baby!)
Apr
- After doing nothing but reading books and surfing the internet for 8 months, realize project is not going to survive
Mid-Jun - Join a Fortune 500 mining company as Manager, Innovative Solutions - 20% pay increase; Big 5 tech company kills project and lets go 30 people, VP on down
Late-Jun - Mining company announces it is buying two smaller mining companies
Sep - Smaller companies purchased by someone else
Nov - Mining company I work for announces it is being bought by a smaller mining company
Nov (2 DAYS later) - Mining behemouth announces it would like to buy the company buying us, possibly killing our merger deal (this never materialized)

In 10 years of working in Corporate America, I've worked for 11 companies in 11 different roles and quadrupled my salary. In June, I was tired of changing jobs, so I sought employment with a stable, locally based, Fortune 500 company. This mining company has existed for over 125 years and has been a staple of the state economy for over 100. Five months after I join them, they get bought.

A few things I've learned over the last 10 years:

1) There is no such thing as "job security"
2) There is such thing as "career security"
3) Technology is a VERY good industry to be in; and the state I'm in is a great state
4) You have to be good at selling yourself if you want the pay you deserve
5) If your skills don't keep up, then you don't move up
6) Diversifying your portfolio helps you retire well; diversifying your income helps you retire early

Bottom line: I only worry about the things I can control - my skills, my attitude, my choices.