Why NOT to fix a bug

Us testers love to have our issues/bugs fixed, especially Sev 1 (i.e. crashing or data loss) ones. Sometimes we love it when they DON’T fix a bug. Say what? Yes, I fought to NOT fix a crashing bug. But I’m getting ahead of myself.

Whenever we find a bug, we assign a number to it denoting the severity of the bug. Maybe it’s trivial issue and the customer would likely never notice it. Maybe it’s a must-fix bug such as a crash, data loss, or security vulnerability. At Microsoft, we generally assign all bugs two numbers when we enter it: Severity and Priority. Severity is how bad the bug is: Crash = 1, a button border color off by a shade = 4. Priority is how soon the bug should be fixed: Search does nothing so I can’t test my feature = 1, Searching for ESC-aped text doesn’t work = 4.

Once we enter a bug, then it’s off to Bug Triage. Bug Triage is a committee made up of representatives from most of the disciplines. At the start of a project, there is a good chance all bugs will be fixed. We know, though, based on data mining our engineering process data, that whenever a bug is fixed, there is a non-zero chance that the fix won’t be perfect or something else will be broken. Early on in the project, we have time to find those new bugs. As we get closer to release, there may not be time to find those few cases where we broke the code.

One more piece to this puzzle: Quality Essentials (QE). It is a list of the practices and procedures – the requirements – that our software or service must meet in order to be released. It could be as simple as verifying the service can be successfully deployed AND rolled back. It could be as mundane as zero-ing out the unused portions of sectors on the install disk.

Now, that bug I told you about at the beginning. We have an internal web site that allows employees to search for and register for trainings. We had a sprint, a four week release cycle, at the end of the year where we had to make the site fully accessible to those with disabilities. This was a new QE requirement. We were on track for shipping on time…as long as we skipped our planned holiday vacations. While messing around with the site one lunch, I noticed that we had a SQL code injection bug. I could crash the SQL backend. The developer looked at the bug and the fix was fairly straight forward. The regression testing required, though, would take a couple of days. That time was not in the schedule. Our options were:
• Reset the sprint, fix the new bug, and ship late. We HAD to release the fix by the end of the year, so this wasn’t an option.
• Bring in more testing resources. With the holiday vacations already taking place, this wasn’t a really good option.
• Take the fix, do limited testing, and be ready to roll back if problems were found. Since this site has to be up 99.999%, this wasn’t a legitimate option.
• Not fix the bug. This is the option we decided to go with.

Why did we go with the last option? There were a couple of reasons:
1) The accessibility fix HAD to be released before the end of the year due to a Quality Essentials requirement.
2) The SQL backend was behind a load balancer, with a second server and one standby. One SQL server was usually enough to handle the traffic.
3) The crashed SQL server was automatically rebooted and rejoined the load balancer within a minute or two, so the end user was unlikely to notice any performance issues.
4) The web site is internal only, and we expect most employees to be well behaved…the project tester, me, being the exception.

So, the likelihood of the crash was small, the results of the crash were small, so we shipped it. After a few days off, the next sprint, a short one, was carried out just to fix and regress this one bug. According to the server logs, the SQL server was crashed once between the holidays and the release of the fix. It was noted by our ever diligent Operations team. But, hey, I was testing the logging and reporting system. 🙂

I would be remiss if I didn’t add that each bug is different and must be examined as part of the whole system. The fix decision would have been very different if this were an external facing service, or something critical such as financial data was involved.

One Response

Leave a comment