Guest Author Ben Kennedy shares more tales of his adventures as a Network Engineer. Watch for part two next week!
It's 2:13am and your cell phone is ringing. You're on call. Those precious few weeks have passed, and it's your turn again. You're not surprised to be getting a call though, because you did a move for a customer at midnight, just before heading to bed. You could have pushed the move off a couple more days and still fell in line with standard turn around times, but the customer's client relations manager used that magic word: "Please?" So how could you say "no" to that? Anyways, the customer probably just encountered a problem and needs to have the move reverted. Easy peasy.
"Hello?" Your voice is horse and barely audible.
"Hey, it’s Kevin. Sorry to bother you…" Kevin The NOC jock starts off the call with the typical late night greeting. "There's something going on we need you to look into. We lost access to a bunch of switches. Looks like it could be a major."
You clear your throat to try and not sound too disoriented before answering. "Okay. I'll get on jabber". You sigh after pressing the end button on the cell phone.
The bed creaks as you get up. From the bed, your better half looks up groggily, silently asking where you're going. "Sorry, I gotta take a look at something." You grumble apologetically. From beneath the covers you hear, "I hate the NOC." The statement is barely done before the sound of even breathing fills the room once again. Both of you know this statement is untrue, but it's 2:14 in the morning and not the time to argue. Plus the clock is ticking.
You plod to the living room and turn on the computer monitor. You shield your eyes as the LCD flares to life. Squinting against the glare, you log in. Within moments you have a rundown of the situation; Which devices are effected, scope of customer impact, and a brief time line. You roll your chair back; rub your eyes to try to force your mind to speed up. You'll need all of your resources to figure this one out.
One of the problems is that what the customers see is just symptom. The root cause could be almost anything. What you have to do is gather as much information, as quickly as possible, and determine the problem that cures those symptoms.
If you were a Network Engineer for an organization using a multi-tiered support system, things would be different. You would do an initial investigation and attempt to implement a solution. If you were unable to find a solution you would escalate up to a tier 3 engineer, while your clients watch the clock tick. A multi-tiered system can be extremely inefficient – not good when speed and flexibility are paramount. Of course, without additional tiers, you’re the last line of defense. You’re the one that solves all issues affecting the network. At times, it's a heavy burden to bear, but it’s one you bear with pride. With this last thought you snort derisively at those other lazy so called "multi-tiered" Network Engineers, and get to work.
Time rolls on as you dive through various troubleshooting iterations. Each one turns up a dead end, but gives you more pieces to the puzzle. The question here is - how far do you go to fix the problem? Every time a customer calls in with a complaint you can't just go ahead and replace the whole network. The scale of the fix must match the scale of the problem, hence the importance of gathering a lot of quality data. Right now in your mind you've created 5 different scenarios that could have caused the symptoms that you're seeing. The tests for each all have varying degrees of impact. Can you do all at once? Of course not. Some theories require tests that can't all be run simultaneously, while others would cause problems for customers that may not otherwise be impacted. You can’t justify jumping to drastic measures, such as replacing a major network device, without taking the time to test a few key links first. In the end you may have to replace that major network device, but you can't fully justify it until you've ensured that all other causes have been accounted for.
Throughout each theory, test, and implementation, you are mentally documenting your steps and timelines. You know that when all is said and done tomorrow you'll need to sit down and dissect how you dealt with this situation, bearing the process of your troubleshooting to a jury of your peers as well as superiors. You're held accountable for the decisions you make, even in the heat of the moment in the middle of the night. And that's just fine.
It's 3:03am and three of your five theories haven't panned out. You're getting into more serious water. Plus that clock is still ticking in your mind, and it's getting louder every second. You lean back for a moment to go over all the steps you've taken up until now before you jump into anything else. All your steps have been sound and logical. You've already replaced one network device. But it looks like it was a different device entirely that caused the failure in the other. A VERY rare occurrence, but when your kingdom consists of thousands of networking devices that span the globe, things are bound to fail. It's a fact you deal with every day.
You're at a point now where you have to act even though the next step will cause more serious customer impact. It's something that you avoid at all costs, but right now it's unavoidable. "Darn" you whisper out loud. If this was an episode of House it would cut to commercial and come back with the situation stabilized and the team sitting around drinking coffee discussing their options. Unfortunately it's just you here in your living room and you have to play this one out. The clock ticks on and the thought of money disappearing and trust being lost runs through your mind as you mentally calculate the sum of the customer impact. You shake your head and make the call to move forward. Members from other departments scramble to get the pieces in place. You get prepped to replace another device, which is no easy thing. Making the call is only half the battle.
It's 3:32am. The solution worked, and everything is back to normal. You log off jabber and disconnect from the conference call. As always, Kevin the NOC jock thanks you for helping out, even though you were just doing your job. You reply "My pleasure" none the less. You try to calm your breathing and slow your heart. Once again your fight or flight mechanism has kicked in - The adrenaline gives you that burst of energy you need to make it through those stressful situations, but it also means you'll be jittery and restless for at least the next hour. After that you can head back to bed. No use waking up anyone else up because you had to save the world, right? You smile to yourself at the thought. You pull out your phone and check your schedule for the day that starts in a few hours. You might be able to swing coming in a bit late to catch up on the lost sleep. You sigh when you see the reminder about that 9am meeting and 10:30am conference call.
Sitting in the dark you start to go over the chronological events of your ordeal, preparing them for the post mortem with the team tomorrow. To Be Continued...