This past weekend, we had an unplanned system event which ultimately caused downtime. Prior to the system crash, there was at least two hours of advance notice that “something bad” was happening. Despite that time given, root cause was not determined then, and is very probable it will never be fully revealed. That’s too bad, because the system crash not only was disruptive to critical operations, but also extra costly because a planned fail-over to a new system — scheduled to occur the very next day — had to be aborted as well.
And there was plenty of breakdowns that justify the lost opportunity. The two primary system engineers were unreachable — one on a planned trip out of state and the other, being myself, was simply not on-call. I habitually respond to ANY event, regardless of my off-hour pager rotation duties — so I am “guilty” of being “too available”. But, in this particular case, my pager and personal BlackBerry were both out of earshot range and I was soundly asleep — probably dreaming of better days ahead in the new IBM BladeCenter. Still, I was ultimately reached just minutes after the system crashed.
Now, a capable systems engineer was on-call, and he responded immediately. However, having the least experience with this particular application implementation, and knowing full well that his actions — not his inaction — could have some very dire consequences not only on this particular event, but very possibly on the year-long project that was supposed to go into effect the next day. A senior application manager was there to assist, too, but between them and the quirky conditions, they were not equipped to handle the slowly degrading conditions — and then it became too late for any kind of recovery.
The result of the crash was an additional hour-long bridge call, discussing past events, how it got there, who did what, what should we do, and a call to a supporting vendor of the application technology. What was lost during that time was recovery. I made my opinion quick and clear — restart, evaluate, and continue. If something is remotely suspected as being wrong, fail-over to the standby system. There were very sound and technical reasons that only I could provide to NOT fail-over to the standby system first. And that suggestion was only going to be challenged by persons with less perspective into the underlying architecture. Even as I uttered the suggested course of action, I knew those details would only encourage an even longer discussion, and the resulting outcome would only remain the same. So I was not going to put up a prolonged fight for what I knew was best and least risky, because it was apparent I did not have the backing of my peers or senior management.
Unfortunately, this is not the first time these crisis events lead to involved, drawn-out conversations with the many stakeholders attempting to determine what the hell is going on with perfectly functioning systems. I can appreciate that process; but the audience is broad for the specialty this particular application and database set requires. Ironically, what compounds the equation and the time to digest it all are fueled by those same, few specialists.
Last year, I was appointed as “captain” to this team of specialists. I tried to dodge the responsibility, not because of my qualifications, and certainly not from a lack of confidence. And it is not like I cannot relate with that group, nor do I possess any lack of respect for their positions – so there are no interpersonal issues to speak of. I am just self-aware and experienced enough to know that each has a focused perspective into a crisis situation that naturally relates back to their position responsibilities. And being professional analysts, there is a natural compensation of logic supplied, even when there are facts outside their scope. It’s ironic that I worded this complex persona that way, because their programming language is purposely configured to allow out-of-scope variables to simply return a NULL result, allowing the aberration to continue without penalty. That precondition is a perfect prescription for imminent disaster.
So as I was joining the bridge call, I had already predetermined that my planned project would be aborted as an outcome of this event. Not a problem — my primary objective is to make their AAA production systems I administer as transparent a process to the organization as possible. I also knew that if I was allowed to simply restart the crashed system, it would fully recover, and at the very least we would have suffered minimal downtime. And by avoiding making the standby systems into the primaries, a probable bonus from that trivial effort would be no rebuilds and validations of those secondary systems — they would simply resume.
Can you guess that it did not happen that way? Was it the title of this blog post that gave it away? Yeah, its Wikipedia definition is spot on.
So, the discouraging, and even demoralizing, part is the current inequities of my position. I have never shied away from accountability for my actions. I have never shied away from taking action, in fear of consequence. If it is my responsibility to keep AAA production systems running, I should be allowed to act. If I act without soliciting appropriate resources, then there are consequences. There is no unclear line. Be prepared to answer for your line of thinking and the actions made as a result.
But with accountability is empowerment. You cannot have one without the other. And that is where I am finding myself without as of late. And it just feels terrible.
Did I solve the crisis? Yes. Did I restore normal operational order in a timely and efficient manner? Yes. I even received gratitude from a variety of persons, whereas I expect none should be given. It is why I am there — to provide and to be compensated for that level of expert services. But I know the costs from the lack of action. And I know that groups of varying qualities will always come to conclusions that meet the lowest common denomonator — and that yields mediocre results. How sad is that? How embarassing is it to be associated with that kind of process that could only yield those low returns?
In its aftermath, I have made my conclusions clear to my peers and superiors. We will have further discussions — change of this nature will not happen overnight. I am adding to my agenda to make this change happen, in a manner that is conducive to the channels and process open to me. It is my only recourse for success. And it is these kinds of challenges that defines the professional I vie, and strengthens my resolve. Even if the outcome is not what I expect now, I will make my attempt and accept its consequences. This crisis is squarely my own, mine to act upon, mine to see to completion. I like my chances already!