Why NOT to fix a bug

Us testers love to have our issues/bugs fixed, especially Sev 1 (i.e. crashing or data loss) ones. Sometimes we love it when they DON’T fix a bug. Say what? Yes, I fought to NOT fix a crashing bug. But I’m getting ahead of myself.

Whenever we find a bug, we assign a number to it denoting the severity of the bug. Maybe it’s trivial issue and the customer would likely never notice it. Maybe it’s a must-fix bug such as a crash, data loss, or security vulnerability. At Microsoft, we generally assign all bugs two numbers when we enter it: Severity and Priority. Severity is how bad the bug is: Crash = 1, a button border color off by a shade = 4. Priority is how soon the bug should be fixed: Search does nothing so I can’t test my feature = 1, Searching for ESC-aped text doesn’t work = 4.

Once we enter a bug, then it’s off to Bug Triage. Bug Triage is a committee made up of representatives from most of the disciplines. At the start of a project, there is a good chance all bugs will be fixed. We know, though, based on data mining our engineering process data, that whenever a bug is fixed, there is a non-zero chance that the fix won’t be perfect or something else will be broken. Early on in the project, we have time to find those new bugs. As we get closer to release, there may not be time to find those few cases where we broke the code.

One more piece to this puzzle: Quality Essentials (QE). It is a list of the practices and procedures – the requirements – that our software or service must meet in order to be released. It could be as simple as verifying the service can be successfully deployed AND rolled back. It could be as mundane as zero-ing out the unused portions of sectors on the install disk.

Now, that bug I told you about at the beginning. We have an internal web site that allows employees to search for and register for trainings. We had a sprint, a four week release cycle, at the end of the year where we had to make the site fully accessible to those with disabilities. This was a new QE requirement. We were on track for shipping on time…as long as we skipped our planned holiday vacations. While messing around with the site one lunch, I noticed that we had a SQL code injection bug. I could crash the SQL backend. The developer looked at the bug and the fix was fairly straight forward. The regression testing required, though, would take a couple of days. That time was not in the schedule. Our options were:
• Reset the sprint, fix the new bug, and ship late. We HAD to release the fix by the end of the year, so this wasn’t an option.
• Bring in more testing resources. With the holiday vacations already taking place, this wasn’t a really good option.
• Take the fix, do limited testing, and be ready to roll back if problems were found. Since this site has to be up 99.999%, this wasn’t a legitimate option.
• Not fix the bug. This is the option we decided to go with.

Why did we go with the last option? There were a couple of reasons:
1) The accessibility fix HAD to be released before the end of the year due to a Quality Essentials requirement.
2) The SQL backend was behind a load balancer, with a second server and one standby. One SQL server was usually enough to handle the traffic.
3) The crashed SQL server was automatically rebooted and rejoined the load balancer within a minute or two, so the end user was unlikely to notice any performance issues.
4) The web site is internal only, and we expect most employees to be well behaved…the project tester, me, being the exception.

So, the likelihood of the crash was small, the results of the crash were small, so we shipped it. After a few days off, the next sprint, a short one, was carried out just to fix and regress this one bug. According to the server logs, the SQL server was crashed once between the holidays and the release of the fix. It was noted by our ever diligent Operations team. But, hey, I was testing the logging and reporting system. 🙂

I would be remiss if I didn’t add that each bug is different and must be examined as part of the whole system. The fix decision would have been very different if this were an external facing service, or something critical such as financial data was involved.

A quick coverage of Code Coverage

Testing is full of numbers:

  • How long will the test pass take?
  • What percentage of the features have you tested?
  • What is the automation test pass rate?
  • How confident are we that the failing tests are real product failures and not failures of the test system?
  • What is my raise going to be?

Code Coverage is just a number.  It tells us how much of the code has been exercised, and maybe verified, by our testing effort.  This is also sometimes called White Box testing since we look at the code in order to develop our test cases.  Management sometimes puts a high value on the code coverage number.  Whether they should or not is a discussion best left to each company.  There are multiple ways we can get code coverage numbers.  Here are three examples.

Block testing

Definition: Execute a contiguous block of code at least once

Block testing is the simplest first order method to obtain a code coverage number.  The strength is it’s quick.  The weakness is it’s not necessarily accurate.  Take a look at this code example:

bool IsInvalidTriangle(ushort a, ushort b, short c)
{
bool isInvalid;
if ((a + b <= c) || (b + c <= a) || (a + c <= b))
    {
        isInvalid = true;
    }
return isInvalid;
}

If we tested it with the values of a=1, b=2, and c=3; we would get a code coverage of about 80%.  Great, management says, SHIP IT!  Wait, you say, there is a weakness of block level testing.  Can you spot it?  The one test case only hits the first condition of the IF statement.  Block level testing will report the line as 100% covered, even though we did not verify the second and third conditions.  If one of the expressions was “<” instead of “<=” we would never catch the bug.

Condition testing

Definition: Make every sub-expression of a predicate statement evaluate to true and false at least once

This is one step better than block level testing since we validate each condition in a multiple condition statement.  The trick is to break any statement with multiple conditions to one condition per line, and then put a letter in front of each condition.  Here is an example:

void check_grammar_if_needed(const Buffer& buffer)
{
A:  if (enabled &&
B:      (buffer.cursor.line < 10) &&
C:      !buffer.is_read_only)
    {
        grammarcheck(buffer);
    }  
}

Our tests would be:

Test  enabled    value of ‘line’   is_read_only   Comment
1 False  N/A  N/A
2 True 11  N/A A is   now covered
3 True 9 True B is   now covered
4 True 9 False C is   now covered

Breaking the conditions into one per line doesn’t really help much here.  This trick will help if you have nested loops.  You can set up a table to help make sure each inner expression condition is tested with each outer expression condition.

Basis Path testing

Definition: Test C different entry-exit paths where C (Cyclomatic complexity) = number of conditional expressions + 1

Does the term “Cyclomatic complexity” bring back nightmares of college?  Most methods have one entry and one or two exits.  Basis Path testing is best applied when there are multiple exit points since you look at each exit path in order to determine your code coverage.  The steps you follow to find the basis paths (shortest path method):

  • Find shortest path from entry to exit
  • Return to algorithm entry point
  • Change next conditional expression or sub-expression to alternate outcome
  • Follow shortest path to exit point
  • Repeat until all basis paths defined

Here is an example:

A:  static int GetMaxDay(int month, int year)
    {
    int maxDay = 0;
B:       if (IsValidDate(month, 1, year))    {
C:         if (IsThirtyOneDayMonth(month))     {
    maxDay = 31;
    }
D:      else if (IsThirtyDayMonth(month))    {
    maxDay = 30;
    }
    else    {
    maxDay = 28;
E:          if (IsLeapYear(year))    {
    maxDay = 29;
        }
    }
    }
    return maxDay;
F:       }

Test cases:

Branch to flip  Shortest path out        Path Input
n/a B==false ABF 0, 0
B B==true,   C==true ABCF 1,1980
C B==true,   C==false, D==true ABCDF 4,1980
D B==true,   C==false, D==false, E==false ABCDEF 2,1981
E B==true,   C==false, D==false, E==true ABCDEF 2,1980

These are just three of the many different ways to calculate code coverage.  You can find these and more detailed in any decent book on testing computer software.  There are also some good references online.  Here is one from a fellow Expert Tester.  As with any tool, you the tester have a responsibility to know the benefits and weaknesses of the tools you use.

Thankfully, most compilers will produce these numbers for us. Code Coverage goals at Microsoft used to be around 65% code coverage using automation.  For V1 of OneNote, I was able to drive the team and get it up to 72%.  Not bad for driving automation for a V1 project.  With the move from boxed products to services, code coverage is getting less attention and we are now looking more into measuring feature and scenario coverage.  We’ll talk about that in a future blog.

Now, what will we tell The Powers That Be?

The key to unlock the tests is in the combination

In the last blog, Andrew Schiano discussed Equivalence Class (EQ) and Boundary Value Analysis (BVA) testing methodologies.  This blog will talk about how to extend those two ideas even further with Combinatorial Testing.

Combinatorial Testing is a form of model-based testing.  It chooses pairs or sets of inputs, out of all of the possible inputs, that will give you the best coverage with the least cost.  Fewer test cases while still finding bugs and giving high code coverage is a dream of us testers.  It is best applied when:

  • Parameters are directly interdependent
  • Parameters are semi-coupled
  • Parameter input is unordered

Let’s look at an example UI.  You have to test a character formatting dialog.  It allows you to pick between four fonts, two font styles, and three font effects.  A chart of the values looks like this:

Field Values
Font Arial, Calibri,Helvetica, BrushScript
Style Bold, Italic
Effects Strikethrough, Word Underline, Continuous Underline

For any selection of text, you can have only one Font, zero to two Styles, and zero to two effects.  Notice how for effects, you could select one pair, strikethrough and continuous underline, but you should not be able to pick another pair, word underline and continuous underline.  You could set up a matrix of tests and check each combination.  That’s old school.  You could reduce the number of cases by applying EQ.  That would be smarter. Your best bet, though, is to apply Combinatorial testing.

Combinatorial testing, also sometimes called Pairwise Testing, is just modeling the System Under Test (SUT).  You start taking unique interesting combinations of inputs, using them to test the system, and then feeding that information back into the model to help direct the next set of tests.  There is mathematics involved that take into account the coupling (interactions) of the inputs, which we won’t cover here.  Thankfully there are tools that help us choose which test cases to run.   I’ll cover the Pairwise Independent Combinatorial Testing (PICT) tool, available free from Microsoft.  Since it is a tool, it is only as good as the input you give it.

The steps to using PICT or any other combinatorial testing tool are:

  1. Analysis and feature decomposition
  2. Model parameter variables
  3. Input the parameters into the tool
  4. Run the tool
  5. Re-validate the output
  6. Modify the model
  7. Repeat steps 4-6

In our example above, the decomposition would look like this:

  • Font: Arial, Calibri, Helvetica, BrushScript
  • Bold: check, uncheck
  • Italic: check, uncheck
  • Strikethrough: check, uncheck
  • Underline: check, uncheck, word, continuous

You feed this data into the tool and it will output the number of tests you specify.  You need to validate the case before running them.  For example, the BrushScript font only allows Italic and Bold/Italic.  If the tool output the test case:

  • Font: BrushScript
  • Bold: check
  • Italic: uncheck
  • Strikethrough: check
  • Underline:  word

Being the awesome tester that you are, you would notice this is not valid.  Thankfully the PICT tool allows you to constrain invalid input combinations.  It also allows you to alias equivalent values.  So, you modify the model, not the outputted values.  In this case, you would add two line so the input file would now look something like this:

Font: Arial, Calibri, Helvetica, BrushScript

Bold: check, uncheck

Italic: check, uncheck

Strikethrough: check, uncheck

Underline: check, uncheck, word, continuous

IF [Font] = “BrushScript” AND [Italic] = “uncheck” THEN [Bold] <> “check”;

IF [Bold] = “uncheck” AND [Italic] = “uncheck” THEN NOT [Font] = “BrushScript”;

The PICT tool also allows you to weight values that are more common (who really uses BrushScript anymore?), and seed data about common historical failures.

Font: Arial(8) , Calibri(10), Helvetica(5), BrushScript (1)

Does this really work?  Here is an example from a previous Microsoft project:

Command line program with six optional arguments:

Total Number of   Blocks = 483
Default test suite
Exhaustive   coverage
Pairwise coverage
Number of test   cases
9
972
13
Blocks covered
358
370
370
Cove coverage
74%
77%
77%
Functions not   covered
15
15
15

Now, pair this with automation so you have data-driven automated testing, and you’re REALLY off and running as a twenty-first century tester!

A few words of caution.  While this gives you the minimum set of tests, you should also test:

  1. Default combinations
  2. Boundary conditions
  3. Values known to have caused bugs in the past
  4. Any mission-critical combinations.

Lastly, don’t forget Beizer’s Pesticide Paradox.  Keep producing new test cases.  If you only run the tool once, and continually run those same cases, you’re going to miss bugs.

There’s a Smart Monkey in my toolbelt

This is a follow on to Andrew’s article “State-Transition Testing.”  If your software can be described using states, you can use monkey automation to test your product.  While smart monkeys can take a little time to implement, even using third-party automation software, their payback can be enhanced if you release multiple versions, follow a rapid release cadence, or are short on testing resources.

Let me start by noting that Smart Monkey testing differ from Netflix’s Simian Army.  Similar name, both are code testing code.  Other than that, they are different.

In general, monkey testing is automated testing with no specific purpose in mind other than to test the product or service.  It’s also known as “stochastic testing” and is a tool in your black box tester tool belt.  Monkey testing comes in two flavors:

Dumb Monkey testing is when the automation randomly sends keystrokes or commands to the System Under Test (SUT).  The automation system or tester monitors the SUT for crashes or other incorrect reactions.  One example would be feeding random numbers into a dialog that accepts a dollar value.  Another example comes from when I was a tester in Office.  One feature I was testing is the ability to open web pages in Microsoft Word.  I wrote a simple macro in Visual Basic for Applications (VBA) to grab a random web page from the web browser, log the web address to a log file, and try to open the web page in Word.  If Word crashed, my macro would stop, and I would know that I hit a severity one (i.e. crashing) bug.  I can watch my computer test for me all day…or just run it at night.  Free bugs!

Funny Aside: I had a little code in the macro to filter out, um, inappropriate web sites.  Occasionally one got through the code, an alarm would go off in our IT department, and my manager would receive an email that I was viewing sites against HR’s policy.  It happened every few months.  I would show that it was the automation, not me, doing the offensive act.  We would laugh about it, put a letter of reprimand in the monkey’s employee file, and move on.

Smart Monkey testing is, well, smarter.  Instead of doing random things, the automation system knows something about your program:

  • Where it is
  • What it can do there
  • Where it can go and how to get there
  • Where it’s been
  • If what it’s seeing is correct

This is where State Transition comes into play.

Let’s look at a State Transition diagram for a Learning Management System (LMS) and how the enrollment status of a student could change.

LMS State Diagram

You would define for the smart monkey automation the different states, which states changes are possible, and how to get to those.  For the diagram above, it might look something like this:

Where it is: Registered

What it can do there #1: Learner can Cancels their registration

What it can do there #2: Learner can Attend the class

Where it is: Canceled

What it can do there #1: Learner can Register for the class

What it can do there #2: Learner can be added to the Waitlist

Where it is: Waitlisted

What it can do there #1: Learner can Cancel their registration

What it can do there #2: The learner will be automatically registered by the system if a spot opens for them

You get the general idea.  This still looks like Andrew’s State-Transition Testing.  What is different is the automation knows this information.  When you start the automation, it will start randomly traversing the states.  Sometimes it will follow an expected path:

Register | Attend

The next time, it might try a path you didn’t expect (the learner ignores their manager and attends anyways):

Request | Request Declined | Walk-in

It might also do things like request and cancel the same class fifty times.

Don’t have time to define all of the states?  Some third party software will randomly explore your software and create the state model and how to traverse it for you. You can then double-check the model it create, make any corrections or additions, and you are on your way!

How can you improve this system?

  1. Have the monkey access an XML or other data file with specific cases you would like hit more than the random ones.  For example, you could use the PICT tool to create a list of the most interesting combinations of inputs to hit.
  2. You can also make this system smarter by assigning probabilities to each state change.  Find out how often the user actually cancels a registration.  Now feed that data into the smart monkey.  Now your monkey will test the state changes at the same frequency as the real user.
  3. The next step?  Tie your monkey into live customer data: Customer data-driven quality (CDDQ).  For example, let’s say all of a sudden your customers start canceling a lot of class registrations due to an upcoming holiday.  Your smart monkey will automatically start testing the cancel registration state change more often.

The whole idea of smart monkey testing is it will follow expected and unexpected paths.  You can run the monkey on spare machines, on your test machine overnight, pretty much anytime; and give you free testing.  If your logging is good enough and tracks the state, and which transition path it followed, you will be able to reproduce any bugs it finds.  Watch your code coverage numbers go up…maybe.  But that’s fodder for another posting.

Long live the smart monkey!

Exploratory Testing == Fun Productivity

You are probably familiar with the testing approaches of black box, white box, and gray box testing.  Each “tool” in the tester’s tool belt can be used in the right circumstances, or misused in the wrong circumstances.  Exploratory Testing (ET) can be used in almost all circumstances, and whether done formally or informally, it is a tool we shouldn’t be afraid to use.

Exploratory testing (ET) is something you probably already do. It is more than just “clicking around” the product.  ET is defined as a test-execution approach where the tester uses information gained while performing tests to intuitively derive additional tests. You can think of it as that little voice in the back of your head telling you “Did I just see something that looked wrong? I better check that out more deeply.” This is subtly different from black-box (BB) testing where you apply tools like Boundary Value Analysis (BVA) and Equivalence Class (EQ) to first develop a list of tests, and second run those tests. It also differs from gray-box (GB) testing where you first use internal knowledge of the structure of the feature and code to develop a list of tests, and second run those tests. You can think of ET as BB and GB testing with a feedback loop—you do test design and test execution at the same time. You are free to explore other avenues of the product in order to track down bugs and issues.

Exploratory testing provides value to the testing effort. It is generally good at evaluating the “look and feel” of a project, but several studies raise important questions about the overall effectiveness and efficiency of behavioral testing and popular exploratory testing approaches to software testing. The details of the studies can be found in chapter six of How We Test Software at Microsoft.

ET can be explained with an analogy (from James Bach’s “Exploratory Testing Explained“):

Have you ever solved a jigsaw puzzle? If so, you have practiced exploratory testing. Consider what happens in the process. You pick up a piece and scan the jumble of unconnected pieces for one that goes with it. Each glance at a new piece is a test case (“Does this piece connect to that piece? No? How about if I turn it around? Well, it almost fits but now the picture doesn’t match…”). You may choose to perform your jigsaw testing process more rigorously, perhaps by concentrating on border pieces first, or on certain shapes, or on some attribute of the picture on the cover of the box. Still, can you imagine what it would be like to design and document all your jigsaw “test cases” before you began to assemble the puzzle, or before you knew anything about the kind of picture formed by the puzzle?

When I solve a jigsaw puzzle, I change how I work as I learn about the puzzle and see the picture form. If I notice a big blotch of color, I might decide to collect all the pieces of that approximate color into one pile. If I notice some pieces with a particularly distinctive shape, I might collect those together. If I work on one kind of testing for a while, I might switch to another kind just to keep my mind fresh. If I find I’ve got a big enough block of pieces assembled, I might move it into the frame of the puzzle to find where it connects with everything else. Sometimes I feel like I’m too disorganized, and when that happens, I can step back, analyze the situation, and adopt a more specific plan of attack. Notice how the process flows, and how it remains continuously, each moment, under the control of the practitioner. Isn’t this very much like the way you would assemble a jigsaw, too? If so, then perhaps you would agree that it would be absurd for us to carefully document these thought processes in advance. Reducing this activity to one of following explicit instructions would only slow down our work.

This is a general lesson about puzzles: the puzzle changes the puzzling. The specifics of the puzzle, as they emerge through the process of solving that puzzle, affect our tactics for solving it. This truth is at the heart of any exploratory investigation, be it for testing, development, or even scientific research or detective work.

Key advantages of ET:

  • Exploratory testing is heavily influenced by the tester’s in-depth system and domain knowledge and experience. The more you know, the better you are at following the paths that are most likely to find bugs or issues.
  • Less preparation is needed.
  • Important bugs are quickly found.
  • ET tends to be more intellectually stimulating than execution of scripted tests.
  • Even if you come back and test the same area again, you are likely to perform your tests in a slightly different way (you aren’t following a script), so you are more likely find more bugs.
  • ET is particularly suitable if requirements and specifications are incomplete, or if there is a lack of time.
  • ET can also be used to  validate that previous testing has found the most important defects.
  • ET is better than just testing.

Key drawbacks of ET:

  • You must manage your time wisely. You need to know when to stop pursuing one avenue and move on to another.
  • You can’t review cases in advance (and by that prevent errors in code and test cases).
  • It can be hard to reproduce tests later unless you are documenting everything you do.  One idea is to use a screen recorder whenever you are doing ET.
  • It can be difficult to know exactly which tests have been run. This can be partially alleviated if you are recording your test steps and creating automation, or if you tracking code coverage.
  • You may end up testing paths that the user would never do. You can use customer data as an addition to your ET so that you don’t spend time testing areas that don’t need be tested.

Use ET when:

  • You need to provide feedback on a new product or feature.
  • You need to quickly learn a new product.
  • You have already been using scripts and seek to diversify the testing.
  • You want to find the single most important bug in the shortest time.
  • You want to check the work of another tester by doing a brief independent investigation.
  • You want to find and isolate a particular defect.
  • You want to determine the status of a particular risk, in order to evaluate the need of scripted tests in that area.
  • You are on a team practicing agile or Extreme Programming.

The last bullet deserves some context. Why would an agile team be interested in ET? Agile teams can suffer from groupthink. The team members spend all day working together, talking, coding, attending meetings, and so on. They tend to start thinking alike. While this helps the agile process, it can hinder testing. Why? Everyone starts to think about the product in the same way and use the product in the same way. Your scripted tests start following the same sequence as the developer’s code. ET can help break that groupthink, randomize the testing, and find issues that the customer would.

Are you a master or an amateur ET tester?

  • Test design: Exploratory tester is first and foremost a test designer. Anyone can design a test accidentally. The excellent exploratory tester is able to craft tests that systematically explore the product. This requires skill such as the ability to analyze a product, evaluate risk, use tools, and think critically, among others.
  • Careful observation: Excellent exploratory testers are more careful observers than novices, and for that matter, experienced scripted testers. The scripted tester will only observe what the script tells them to observe. The exploratory tester must watch for anything unusual or mysterious. Exploratory testers also must be careful to distinguish observation from inference, even under pressure, lest they allow preconceived assumptions to blind them to important tests or product behavior.
  • Critical thinking: Excellent exploratory testers are able to review and explain their logic, looking for errors in their own thinking. This is especially true when reporting the status of a session of exploratory tests investigating a defect.
  • Diverse ideas: Excellent exploratory testers produce more and better ideas than novices.  They may make use of heuristics to accomplish this. Heuristics are devices such as guidelines, generic checklists, mnemonics, or rules of thumb. The diversity of tester temperaments and backgrounds on a team can also be harnessed by savvy exploratory testers through the process of group brainstorming to produce better test ideas.
  • Rich resources: Excellent exploratory testers build a deep inventory of tools, information sources, test data, and friends to draw upon. While testing, they stay alert for opportunities to apply those resources to the testing at hand.

Exploratory testing can be valuable in specific situations and reveal certain categories of defects more readily than other approaches. The overall effectiveness of behavioral testing approaches is heavily influenced by the tester’s in-depth system and domain knowledge and experience. Of course, the effectiveness of any test method eventually plateaus or becomes less valuable and testers must employ different approaches to further investigate and evaluate the software under test (The Pesticide Paradox).