Everyone has to do trouble shooting, whether they are housewives or mechanical engineers. Trouble shooting is a process and generally can be learned although some people have a natural ability bordering on nearly psychic in level.
The process of trouble shooting takes you from the known, to the unknown. The first step in any trouble shooting, solving a math problem, or, figuring out why a space probe is malfunctioning over a million miles away. The first step is to make a list, either an actual list or a mental list. For example the classic problem, where are my car keys? What do you know?
- I had my keys last night when I got home
- I don’t have my keys now
- I checked my key bowl, they weren’t there
Next, you need to determine what has changed in the environment. Sometimes leaving the familiar for a while then returning allows you to see things that “blend in” from familiarity. This is similar to a writer not being able to see the misspellings or missing words until they put the article away and look at it the next day. Again, making a list either on paper or in your mind helps you crystallize the changes.
- The table has several newspaper sections on it
- I was very tired, my pants are on the floor from last night
First, check the differences, in the above case look through the papers on the table, and check the pants pockets. In this case it is a no go. Let’s move on to round two, think again, what was different than normal?
- I was wearing a jacket because of the recent cold snap
Ok, check the jacket pockets, again no go. Ok, so we have determined that the differences in the environment were not a factor. Next, we should re-trace the actions up to the loss of the keys. So the last time I had them was at the car when I got out last night. So I went down to the car and looked in the window (the doors were locked) no keys in the ignition. I followed the path up to the house, viola! The keys were still in the door…like I said I was tired…oops, I was also carrying a bunch of stuff, something that I should have listed on the changes in the environment list.
Perhaps this was a simplistic example, it becomes more problematic when you deal with complex systems. Complex systems have power cords, connections and internal subcomponents, think about the computer you use. The computer is an electric device that has a cord and plug and may have multiple external components (printer, monitor, keyboard, mouse, network connection) as well as internal components (video cards, Ethernet cards (called NICs) and maybe fibre channel cards (called HBAs). In a complex system the steps are the same:
- Define what you know
- Determine changes since conditions were stable
- Isolate each change (back it out) to see if it caused the failure or problem
Step 1 in an electrical system also involves making sure things are as they should be, by this I mean:
- Check power supply, is the system plugged in (on both ends) is the power supply working?
- Are all the cables tight/plugged in properly?
- Are all the internal cards seated properly with good contact?
You would be surprised how many problems are solved at step 1! Oh, also add is the on/off interface to the on position (are the switches in the right position?)
Step 2 requires usually talking to the user of the system (if it isn’t you!) and as shown on the TV show House (about a cantankerous Doctor who uses differential diagnosis to cure people) everyone lies! People will insist that nothing has changed, but usually if you keep pushing they will admit they added a new program, changed a NIC or HBA or monkeyed with settings (switches or software.)
Step 3 happens if indeed something was changed, if so, get the system back as close as possible to the configuration where it worked, then apply the changes one at a time to determine what caused the problem.
Now what about when nothing has changed? Then you find out as much as possible about what was going on when the failure occurred, however sometimes you may not have anyone to ask. In the case of computers there are system logs than can be reviewed (if the system is up, just not working right.) If needed, the system disk drive can be attached to another working system to allow the logs to be reviewed (remember to set it as a non-booting drive or you may bring down another system, also scan it immediately for virus, worms or Trojan horses.)
We have discussed two situations, one, lost keys, can be fairly simple to solve, the other, a malfunctioning computer, can be complex, but just remember to break it down into: determine what you know, what has changed and determine what change caused the problem.