A quick coverage of Code Coverage

Testing is full of numbers:

  • How long will the test pass take?
  • What percentage of the features have you tested?
  • What is the automation test pass rate?
  • How confident are we that the failing tests are real product failures and not failures of the test system?
  • What is my raise going to be?

Code Coverage is just a number.  It tells us how much of the code has been exercised, and maybe verified, by our testing effort.  This is also sometimes called White Box testing since we look at the code in order to develop our test cases.  Management sometimes puts a high value on the code coverage number.  Whether they should or not is a discussion best left to each company.  There are multiple ways we can get code coverage numbers.  Here are three examples.

Block testing

Definition: Execute a contiguous block of code at least once

Block testing is the simplest first order method to obtain a code coverage number.  The strength is it’s quick.  The weakness is it’s not necessarily accurate.  Take a look at this code example:

bool IsInvalidTriangle(ushort a, ushort b, short c)
bool isInvalid;
if ((a + b <= c) || (b + c <= a) || (a + c <= b))
        isInvalid = true;
return isInvalid;

If we tested it with the values of a=1, b=2, and c=3; we would get a code coverage of about 80%.  Great, management says, SHIP IT!  Wait, you say, there is a weakness of block level testing.  Can you spot it?  The one test case only hits the first condition of the IF statement.  Block level testing will report the line as 100% covered, even though we did not verify the second and third conditions.  If one of the expressions was “<” instead of “<=” we would never catch the bug.

Condition testing

Definition: Make every sub-expression of a predicate statement evaluate to true and false at least once

This is one step better than block level testing since we validate each condition in a multiple condition statement.  The trick is to break any statement with multiple conditions to one condition per line, and then put a letter in front of each condition.  Here is an example:

void check_grammar_if_needed(const Buffer& buffer)
A:  if (enabled &&
B:      (buffer.cursor.line < 10) &&
C:      !buffer.is_read_only)

Our tests would be:

Test  enabled    value of ‘line’   is_read_only   Comment
1 False  N/A  N/A
2 True 11  N/A A is   now covered
3 True 9 True B is   now covered
4 True 9 False C is   now covered

Breaking the conditions into one per line doesn’t really help much here.  This trick will help if you have nested loops.  You can set up a table to help make sure each inner expression condition is tested with each outer expression condition.

Basis Path testing

Definition: Test C different entry-exit paths where C (Cyclomatic complexity) = number of conditional expressions + 1

Does the term “Cyclomatic complexity” bring back nightmares of college?  Most methods have one entry and one or two exits.  Basis Path testing is best applied when there are multiple exit points since you look at each exit path in order to determine your code coverage.  The steps you follow to find the basis paths (shortest path method):

  • Find shortest path from entry to exit
  • Return to algorithm entry point
  • Change next conditional expression or sub-expression to alternate outcome
  • Follow shortest path to exit point
  • Repeat until all basis paths defined

Here is an example:

A:  static int GetMaxDay(int month, int year)
    int maxDay = 0;
B:       if (IsValidDate(month, 1, year))    {
C:         if (IsThirtyOneDayMonth(month))     {
    maxDay = 31;
D:      else if (IsThirtyDayMonth(month))    {
    maxDay = 30;
    else    {
    maxDay = 28;
E:          if (IsLeapYear(year))    {
    maxDay = 29;
    return maxDay;
F:       }

Test cases:

Branch to flip  Shortest path out        Path Input
n/a B==false ABF 0, 0
B B==true,   C==true ABCF 1,1980
C B==true,   C==false, D==true ABCDF 4,1980
D B==true,   C==false, D==false, E==false ABCDEF 2,1981
E B==true,   C==false, D==false, E==true ABCDEF 2,1980

These are just three of the many different ways to calculate code coverage.  You can find these and more detailed in any decent book on testing computer software.  There are also some good references online.  Here is one from a fellow Expert Tester.  As with any tool, you the tester have a responsibility to know the benefits and weaknesses of the tools you use.

Thankfully, most compilers will produce these numbers for us. Code Coverage goals at Microsoft used to be around 65% code coverage using automation.  For V1 of OneNote, I was able to drive the team and get it up to 72%.  Not bad for driving automation for a V1 project.  With the move from boxed products to services, code coverage is getting less attention and we are now looking more into measuring feature and scenario coverage.  We’ll talk about that in a future blog.

Now, what will we tell The Powers That Be?

Testers Caught Sleeping on the Job

At Microsoft, we submit every code change to a peer-review before it’s checked in. I’ve performed hundreds (although it feels like millions) of code reviews in the six years I’ve been in Test. One of my biggest red flags during a code review is the Sleep statement.

The Sleep command suspends the current thread for a specified period. A typical test that uses Sleep might look like this:

  // Send a message 

  // Wait 2 minutes for message to be received.

  // Message must be there. Let's read it.

The problem with Sleep statements is that they usually sleep either too long or not long enough. If the sleep is too short, your test will fail because the expected state hasn’t been reached. In the above example, if it takes three minutes to receive the message, the test will fail since it’ll try to read a message that isn’t there yet.

If the sleep is too long, your test might still fail; the expected state may have come and gone. But even if it passes, the test isn’t as efficient as it could be. This could lead to test passes that take a long time to complete. If this message was received in one minute, the test will take a minute longer than required. One minute might not sound like a lot, but if you have thirty similar tests, you’re wasting a half hour of run time.

When you find a long Sleep statement inside a loop, there could be room for a significant performance improvement. During one code review, I found a test that looked like the following; this one loop wasted more than a half hour of run time.

  // Send a bunch of messages 
  for (int i=1; i<30; i++) 
     // Send one message 

     // Wait 2 minutes for message to be received. 

     // Read the message.   

Long Sleep statements are also a common cause of tests that fail intermittently. The examples we’ve been using assume we’ll receive every message within two minutes. Even if that’s the case 99% of the time, if the test runs every day it’ll still fail every few months. Furthermore, these tests aren’t very portable. If you run them on a faster or slower machine, or on a server with a different load, the tests might fail. Each of these false positives must then be investigated, wasting your semi-precious time.

When you find a Sleep that lasts more than a few seconds, the best solution is to remove the Sleep and subscribe to an event that’s raised when the expected state occurs. This removes all the guessing out of your test case, and ensures you’re not waiting too long or too short. Such an event might not exist, in which case you may be able to write one yourself, or ask the product developer to create one.

If an event-based solution is impractical to use, all hope is not lost. In that case, replace the long Sleep with logic that polls until either the expected event occurs or until a timeout period is reached. In test above, we saved thirty minutes of run time by replacing the two-minute Sleep with the following:

  private static bool PollForMessage() 
     DateTime timeout = DateTime.Now.AddMinutes(2); 
     bool gotIt = MessageExists(); 

     // Loop until message received or two minutes has passed 
     while (!gotIt && DateTime.Now < timeout) 
        gotIt = MessageExists(); 
     return gotIt; 

You may be wondering why I’m whining about Sleep statements, but then went ahead and used one in this solution. That’s because, at most, this routine will only sleep one second longer than it should, which isn’t too shabby. If you find the message isn’t always received within two minutes, you can safely go ahead and change the timeout to three minutes without wasting any run time.

The next time you find yourself writing a long Sleep statement, replace it with either an event, or a function that polls until your expected state occurs. If you do this consistently, your test passes will finish faster with more consistent results.

P.S. After I wrote this article, I found that BJ Rollison recently wrote about the same topic, only much more eloquently, on his own blog. See for yourself.

TestApi… A Forgotten Soldier in the Fight Against Bugs.

Bug Fighters ala Starship Troopers

I was talking to my team today about doing globalization testing as a part of normal tests in our UI. In this conversation we were discussing ways of generating random strings using different unicode chars. I mentioned that there was a Microsoft created library for use in many different areas of testing, one of which is for string generation. However not many testers, even in Microsoft, know about or use this handy library. I’d previously used the string generation portion to do some fuzz testing in a different project, because it was much easier to use than most fuzz tools normally suggested for this purpose.

I decided it might be a good idea to disseminate this information so that it gets more use amongst our test warriors. The following is the basic information to get you started.

Here is the info on the TestApi library that Microsoft made and has many different uses:

§ Overview of TestApi

§ Part 1: Input Injection APIs

§ Part 2: Command-Line Parsing APIs

§ Part 3: Visual Verification APIs

§ Part 4: Combinatorial Variation Generation APIs

§ Part 5: Managed Code Fault Injection APIs

§ Part 6: Text String Generation APIs

§ Part 7: Memory Leak Detection APIs

§ Part 8: Object Comparison APIs

Here is an example from the String generation section that allows you to generate random strings for testing.

  // Generate a Cyrillic string with a length between 10 and 30 characters.

  StringProperties properties = new StringProperties();
  properties.MinNumberOfCodePoints = 10;
  properties.MaxNumberOfCodePoints = 30;
  properties.UnicodeRanges.Add(new UnicodeRange(UnicodeChart.Cyrillic));

  string s = StringFactory.GenerateRandomString(properties, 1234);

  The generated string may look as follows:
  s: Ӥёӱіӱӎ҄ҤяѪӝӱѶҾүҕГ


Look to the Data

I was recently asked to investigate my team’s automated daily test cases. It was taking more than 24 hours to execute our “daily” test run. My job was to find why the tests were taking so long to complete, and speed them up when possible. In the process, an important lesson was reinforced: we should look to real, live customer data to guide our test planning.

I had several ideas to speed up our test passes. My highest priority was to find and address the test cases that took the longest to run. I sorted the tests by run-time, and discovered the slowest test case took over an hour to run. I tackled this first, as it would give me the biggest “bang for the buck”.

The test validated a function that took two input parameters. It iterated through each possible input combination, verifying the result. The code looked like this:

for setting1 = 1 to 5
  for setting2 = 0 to 5
    validate(setting1, setting2);

The validate function took two minutes to execute. Since it was called thirty times, the test case took an hour to complete. I first tried to improve the performance of the validate function, but the best I could do was shave a few seconds off its run-time.

My next thought was whether we really needed to test all 30 input combinations. I requested access to the live production data for these fields. I found just two setting combinations accounted for 97% of the two million production scenarios; four combinations accounted for almost 99%. Many of the combinations we were testing never occurred at all “in the real world.”

Most Common Data Combinations

setting1 setting2 % of Production Data
5 5 73%
1 5 24%
1 0 .9%
2 1 .8%
2 5 .7%
3 5 .21%
1 1 .2%
3 0 .1%
4 0 .09%

I reasoned that whatever change I made to this test case, the two most common combinations must always be tested. Executing just these two would test almost all the production scenarios, and the run-time would be cut to only four minutes. But notice that the top two combinations both have a setting2 value of 5. Validating just these two combinations would leave a huge hole in our test coverage.

I considered testing only the nine combinations found in production. This would guarantee we tested all the two million production scenarios, and our run-time would be cut from one hour to 18 minutes. The problem with this solution is that if a new combination occurred in production in the future, we wouldn’t have a test case for it; and if there was a bug, we wouldn’t know about it until it was too late.

Another possible strategy would be to split this test into multiple test cases, and use priorities to run the more common scenarios more often. For example, the two most common scenarios could be classified as P1, the next seven P2, and the rest P3. We would then configure our test passes to execute the P1s daily, the P2s weekly, and the P3s monthly.

The solution I decided on, however, was to keep it as one test case and always validate the two most common combinations as well as three other, randomly generated, combinations. This solution guaranteed we test at least 97% of the live production scenarios each night. All thirty combinations would be tested, on average, every 10 days–more often than the “priority” strategy I had considered. The solution reduced the test’s run-time from over an hour to about 10 minutes. The entire project was a success as well; we reduced the test pass run-time from 27 hours to less than 8 hours.

You may be thinking, I could have used pairwise testing to reduce the number of combinations I executed. Unfortunately, pairwise doesn’t help when you only have two input parameters. This strategy ensures each pair of parameters are validated, so the total number of combinations tested would have remained at thirty.

In hindsight, I think I could have been smarter about the random combinations I tested. For example, I could have ensured I generated at least one test case for each possible value of setting1 and setting2.

I also could have associated a “weight” with each combination based on how often it occurred in production. Settings would still be randomly chosen, but the most common ones would have a better chance of being generated. I would just have to be careful to assign some weight to those combinations that never appear in production; this would make sure all combinations are eventually tested. I think I’ll use this strategy the next time I run into a similar situation.

%d bloggers like this: