Death by a Thousand Little Bugs

Software bug

Minor product defects that take only a few minutes to resolve are often never fixed; it seems there are always more important tasks to work on. If this sounds familiar, your test team may suffer from morale issues. And your product may suffer from “death by a thousand little bugs”. Fortunately, these problems can be fixed as easily as these bugs can.

Once testers get their hands on a feature, it doesn’t take long for low-priority defects to pile up in their bug-tracking database. These may include, for example, minor UI issues such as missing punctuation, inconsistent fonts, or grammar errors. These bugs tend to pile up because they are primarily cosmetic. Testers resolve the highest-priority bugs first–often rightly so. We should fix bugs that greatly affect functionality, performance, or security before fixing a spelling typo in the UI.

What can happen, however, is that we never fix many of these low-priority bugs. There are often more critical defects being discovered, so we continuously postpone the low-priority ones.

Unfortunately, some of the bugs left behind are those that were logged the earliest. There are few things I find more frustrating than reporting a simple bug that doesn’t get fixed. My typical complaint sounds something like this: “Why hasn’t this bug been fixed? I logged it weeks ago. It’s a one-line change that will take only two minutes to fix!”

A previous project I worked on provides a perfect example. Not long after I was given the first working build of the UI, I logged two minor bugs. One issue was logged because two buttons on the same page were not aligned properly. The other bug was simply that a sentence ended with an extra period. When the product was released more than four months later, the misaligned buttons and the extra period were still there.

Another problem is that even if these low-impact bugs don’t affect functionality, they can greatly affect the customer’s perception of the product. How can a customer fully trust a product, no matter how well it actually works, if there are mountains of minor defects? This is the “death by a thousand little bugs” syndrome.

Before I came to Microsoft, I ran an online store. One night I modified the shopping cart page, and the next day sales plummeted. When I reviewed the changes I had made, I realized that I misspelled two words and added a broken image link. I fixed these issues and sales quickly went back to normal.

The functionality of the page hadn’t changed at all. But potential customers saw the “minor” errors and assumed the entire shopping cart had poor quality. They certainly didn’t rationalize, “They must have spent all their effort making sure the functionality was solid. That’s why they postponed these obvious, but low-priority bugs.”

The “death by a thousand little bugs” syndrome exists because most teams evaluate each bug individually–and individually, each of these bugs is trivial; but in the aggregate, they are not. Collectively, they make users skeptical of your product.

The solution is that we shouldn’t always address high-priority bugs before low-priority bugs. But when do we make the exceptions? Here are three strategies that I think could help solve these problems.

  1. Set aside one day each month for developers to address the low-priority, low-hanging-fruit bugs. This is a great way to fix a lot of bugs in a short amount of time. It can also prevent your product from suffering from “death by a thousand little bugs.”
  2. Put aside one day every month to fix the defects that have been in the bug database the longest–regardless of priority. This helps prevent testers from becoming demoralized because bugs they logged months ago still haven’t been fixed.
  3. Once a month, increase the priority of all bugs that are least 30 days old. Developers can continue to pull bugs out of the queue in priority order, but the difference is that after one month, a bug that was logged as P4 (lowest priority) becomes a P3. After three months, it becomes a high-priority P1 bug. It may initially sound odd that low-priority defects, such as a misspelled word in a log file, will eventually be classified as highest priority. But doing so forces some action to be taken on the bug. As a P1, it now must either be fixed or closed by the Programmer Manager as “Won’t Fix”.

You may be thinking, “but I’m a tester, and these solutions have nothing to do with testers.” When I started in Test, that’s how I thought. I now realize that my primary responsibility is to make my product better, not just to log bugs. If these strategies would work well for your team, then you should lobby for them–they may even increase your own morale along the way.

Do you think any of these strategies work well for your team? What strategies have you tried in the past, and how have they worked? I’m very interested in hearing your comments.

Advertisements

8 Responses

  1. Death by a thousand little bugs can certainly be painful. But I’m not sure if I could agree less for the “strategies” you propose. They seem to be more tactical bandaids that could cause further problems.

    For example for #1 and #2 above, setting a day a month to fix these bugs could be hugely risky. This would quickly increase the amount of technical debt for a team and could actually promote a behavior of “Meh, I’ll fix this little issue later on ‘little bug fix day’.” And if the team is already pushing these bugs due to other issues coming up, who’s to say that the team will actually stick to scheduling a day and fixing those bugs? There’s also plenty of studies showing bug clusters due to code churn with increasing chances of regression. If a team already has a problem racking up these little bugs, do you think that same team will do a ‘good job’ actually fixing these without destabilization?

    Increasing the priority of older bugs seems wrong wrong wrong. If a bug keeps getting postponed because it’s not really important, then face the facts – it ain’t getting fixed. And does it really need to?

    Which leads to the real questions – why is the team behaving in such a manner where they want to keep increasing their technical [and mental] debt like this? Why hasn’t the team read The Pragmatic Programmer? Why is code being checked in with all of these “little bugs” to begin with?

    Notice I’m using the word “team” above. The team needs to establish the right culture of quality and rules behind what bugs are really important with the proper criticality (severity, priority). How can the culture be improved so that those who are checking in the code can see these “little bugs” before commited? This is usually indicitive of teams who build silos between dev and test; dev checks in code and throws it over the fence for test to pick up later. This isn’t a dev problem or test problem, this is a team problem.

    Lastly, if there are so many of these little one liners piling up, is there an opportunity for the tester to just fix the issue herself? If the team is truly letting these issues pile up and have more critical bugs to fix, then I would imagine that the devs would be happy to have testers hop on and fix some of these issues on their behalf. Or file better bugs – if bugs have detailed info and suggestions on how to fix (or even a shelveset/bbpack/pullrequest with the changes done), then it should be trivial for the dev to get this checked in, or for them to tell you to just check it in yourself. And if you’re on a team where testers aren’t allowed to check in product code, then you’ve got bigger problems (and it’s time to update your resume).

  2. @nitdoggx – I agree with your main point that, overall these can be looked at as a “team problem”. However I don’t agree with what seems to be an underlying assumption that there’s a “one size fits all” best approach here. Sometimes the test team IS kept very separate to the development project for various reasons. Independence may be a key factor. They may be actually totally different organisations or even different companies. Configuration management could become a complete mess on huge global team projects if all testers had access to check in product code.

    I think you need to re-consider that there’s many different approaches that could be useful depending on the situation, as the original article does.

    • @Rik I’m not making a “one size fits all” assumption at all. These are very tricky issues that could have multiple solutions or no solutions at all. What I’m trying to say is that analysis should be done to understand contributing factors and go from there. Ask “how did we get here and why?” For some organizations where QA is well integrated, then there *may* be easier solutions for establishing better processes and culture to reduce the compounding tech debt. For organizations where QA is kept very separate, then of course it will be more challenging and may require different courses of action. As you mention, it totally depends on the situation. I’m just not sure if the strategies proposed in the article would help the team get back on track. And I’m certainly not saying that my proposals will help, but they may. And they have worked well for my various teams in the past. This is all IMHO, YMMV. .

      • Thanks for the follow-up Nithin. 🙂 In that case I completely agree with you, there are likely to be much better solutions at the full-team level than those suggested in the article – I guess you voted “none would work” then; I can see that…

        However I’m honestly not worried about my job and checking my resumé purely because I’m not allowed to check-in product code! 😉

  3. @nitdoggx — Thanks for the feedback. Just one comment.

    You wrote, “Increasing the priority of older bugs seems wrong wrong wrong. If a bug keeps getting postponed because it’s not really important, then face the facts – it ain’t getting fixed. And does it really need to?”

    You’re right, it does initially seem odd to raise the priority of older bugs. And no, these bugs don’t necessarily need to be fixed. But raising the priority forces SOME action to be taken on the bug. That action may be to close the bug without fixing it. But at least the bug has been re-considered and addressed while there is still time to fix it. When a bug stays as a P4, it’s often never looked at again until just before the project is released. And at that time its too late to risk the churn.

  4. Assuming they are categorized correctly, low-priority bugs really are low priority compared to high-priority bugs. So why would we want to redirect people to work on lower priority problems? I think the answer is that a group of low priority issues, taken collectively, could amount to a high priority issue. While one typo may not have a big effect, a collection of many typos can have a dramatic effect on the user experience. So it makes sense to take some action to look for trends and problems in collections of bugs as well in individual bugs. The proposals made here are project-management solutions to getting the engineers to look at these kinds of problems. They don’t really address the engineering questions behind how did the bugs get there and what should be done about them.

    A system that arbitrarily increases priority based on age or allocates a fixed percentage of engineering budget to work on less important problems seems to be designed to do something rather than thinking through the right thing to do. Still it can act as check to make sure something gets done instead of ignoring a possible problem. I suspect that different development teams can come up with something that will work better for their own situations.

    Another one I’ve seen used is the idea of “bug jail”. When one programmer (or one component or feature or whatever) have more bugs that a certain threshold, development must stop until the number of bugs is back below the threshold. Like the others, this doesn’t ensure the right thing happens, but it does put bounds on how bad the problem can get before something happens.

  5. @Ralph Case, @nitdoggx

    You guys are both right that these strategies don’t address the engineering questions behind how these bugs got there in the first place. I don’t know how to prevent all these little bugs from happening–I am not sure anyone does. That’s a much more difficult problem to solve.

    But just because we don’t have a vaccine to prevent the disease, it doesn’t mean that we shouldn’t look for a cure. The solutions I am suggesting are all very simple, and any team can implement them today. They don’t require changing the culture on your team, which is difficult to do, and takes a long time. They don’t require changing the ratio of developers to testers. They are simple strategies that will help cure — not prevent — the problem, and take almost zero effort to implement.

    Ralph, I like the idea of “Bug Jail” that you described. This is very similar to the strategies that I proposed, in that they force you to do *something* with the low-priority bugs before its too late. I think it would work just fine.

  6. Nice topic Andrew.
    In every company I have worked in, from very large… to start-ups, this problem always exists. Teams will naturally work on the highest priority/severity defects, and if they have time will try to get to the low level bugs. Never happens, because the next big bug is right around the corner. Good project management will go after the low hanging fruit periodically, its also a great morale boost to knock out lots of small problems in the length of time for one big one.

    People… I am 100% in agreement with Andrew, that low level bugs can be as disastrous as major bugs.. maybe more so in some cases. His example of the Web site is a perfect example, I can put up with slow response, maybe occasional crashes or hangs, but if I see poorly worded, misaligned buttons, etc etc.. I’m out of there is a second..

    I think as eng we tend to think about the tech inside, and fail to realize that it’s the surface stuff we interact with 99% of the time. I’m sure you’ve heard that first impressions are the most important”? Well In this case the UI is the “image” of the company. It is the first thing users see, and they see it often. So a typo, etc is constantly staring them in their face. A loose analogy would be; you buy a nice flat screen TV and the black piano finish around the edges is slightly mismatched but you can see it when light reflects off of it, and some of the buttons on the front are misaligned, and the on/off LED is off center a little. However inside, only the best design and components are used. Most people don’t really care about what’s inside or how well the core SW is written, if the thing they see the most is badly done. It reflects poorly on the company, and people will naturally assume “if they don’t care about the obvious, then what about the really important stuff?”

    That said…. brace yourselves…. the fault mainly lies with QA. Yes “TQM” and “Quality is job one” is nice, but often it is QA that needs to stand up for the user and protect the best interests of the company. I always look at our role and how are we presenting/managing the important Quality aspects of our product. To be effective, QA must create defect classifications that take the corporate strategy and how often a minor issue is encountered into account when classifying defects. I could write an article just on this, but basically I use two main categories for defect classifications, Severity and Priority. Severity is when the defect happens how bad is it. The second is the Priority, this is adjusted by the team based on many factors… but one would be how often the defect is encountered. Priority is the determining factor used by the team for the order defects should be fixed in. With this method consider two defect scenarios; #1 A crash (no loss of data/life etc) that happens rarely would have a Severity of “Very High”, but the Priority is “Low” because it occurs very infrequently and is difficult to reproduce. #2 On the other end of the spectrum, is a defect for misspelling, the Severity is Low, but the Priority is high because it is seen every day by millions of people. In these two examples, the dev team would fix the #2 before #1. (Note; a third category that uses a weighted multiplier for Severity / Priority could be used called “Risk” that would determine the overall importance of the defect.)

    In closing, the team is naturally setup to work the defects in order of importance, but you must be prepared to argue the merits of raising the importance of a defect, and be prepared for rough waters at first. A very effective approach to support your argument, that is true for most QA activities, is also the hardest to pull off, “Find out what is the most important to the company or team at the moment” and classify the low hanging defects according to that criteria.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: