Intermittent Issues

Troubleshooting sporadic software issues. An investigation approach to understand and resolve intermittent problems in software.

by Ben Allums
May 3, 2011

These are always a challenge. What can make things occur at one time and not the other, and more importantly yet more confusingly why? There does not seem to be an online guide that can instruct someone on any product how to go about troubleshooting for an issue that happens sporadically. My guess is that these happen mostly at the software level. (Don't worry: Engineering and Development. We still like you). So, in my role, the process of investigation begins. In researching this subject, I realize that in the world of Software Support, that in addition to taking a mechanic's role, we often take a bit of an investigative role. I don't usually have a specific order of questions, but one could say it is a 4W (minus the “Who”, because it is obviously happening to you) approach. So, let's begin with “Where”. The accompanying question to where would be to be able to isolate it to where it occurs, and does this occur on just one or more systems.

Next we follow up with “When”. Inspecting the log becomes essential at this point. On the log screen you will see various stages of output generation. Let's say that you run across issues during the Image pipeline. If your inconsistent errors are consistently happening at the Image pipeline, most likely there is something going on with the graphic generation, which leads into the “What” question. So, in our example we determined that it happens at the graphic pipeline. So, what aspect of the graphics are we trying to isolate? Is it an issue with the Rasterizer? These are things that can be tested by using by-reference images in your input. Once you determine what is the cause of the graphic culprit, you can begin to ask the final question, “Why”. These can be as simple as needing to close a dialog box, however if you cannot figure out why it occurs, then chances are you will not be able to resolve this. Sometimes, you might get lucky and the issue goes away, but we want a better success rate than chance.

Ultimately, the goal in Support is being able to reproduce and isolate issues. Given system specific issues, isolating is an absolute must. How can we fix something that we know nothing about? If you have submitted a case, and it seems like we are asking many questions, there is a reason for it, and the reason is that we are trying to help you as quickly as possible. Also, feel to share any troubleshooting insights that you have.

Further Reading