Why 100% Test Pass Rates Are Bad

football goal

Are you shooting for the wrong goal?

When your test pass rate reaches 100% it’s usually cause for celebration. It shows your product has high quality. The tests you so carefully designed are passing, proving the product works as expected. It also shows your test cases have high quality; the results are repeatable and they’re successfully validating product functionality. This is the goal you’ve been striving for since you starting testing, right? If that’s the case, you’ve been shooting for the wrong goal.

A 100% pass rate implies your product works exactly as expected. The goal of testing, however, isn’t to prove the software works as expected. In fact, this objective can never even be achieved, because all complex software has defects. What sense does it make to have a goal that can’t be achieved? Your objective should be to write tests that expose these defects so an informed decision can be made on whether to fix them.

The other reason a 100% pass rate is the wrong goal is because when all your tests are consistently passing, they stop providing you new information. You see a perfect pass rate every morning, and pay no further attention to the tests. In other words, once you reach your “goal”, your tests stop helping you improve your product.

To be fair, from a management perspective a 100% pass rate is the right goal. It gives managers confidence in the quality of the product. But it can’t give them that confidence if both the managers and testers are striving for a perfect pass rate. In that case, no one is looking for bugs and the 100% pass rate means little.

Now assume you have a perfect pass rate, but the testers are striving to prove the software doesn’t work as expected. This shows management that despite the fact you’re looking for bugs, you still can’t find any. That will give management confidence in the software quality.

If you have a 100% pass rate, and still have time before Test Complete, don’t think of it as the end game. There are bugs in your product, and your tests aren’t finding them. Modify your tests. Write new ones. But design every test case with the goal of proving the software does NOT work as expected. The advice may sound obvious, but some testers don’t think this way; they’re much more interested in having their test passes go smoothly and on schedule.

Edsger Dijkstra wrote, “Testing can show the presence of errors, but not their absence.” When you change your mindset and write tests to prove the product doesn’t work, rather than to prove that it does work, you’ll be surprised how many more bugs you expose. This simple change will make you a better tester, and your products will be better for it.

When Can Testing End?

Mohawk Stop SignLast year a program manager sent me an email I still think about. The product we were working on was just released to our dogfood environment, where it would be used by employees before being released to customers.

The next day we were still completing the analysis of our automated test failures. Our initial investigation determined that these failures shouldn’t hold up the release; they either didn’t meet the bug bar, or we thought they were caused by configuration issues in our test environment. But as I continued investigating, I discovered one of the failures actually was caused by a product bug that met the bar. Later that day, I found a second product bug. Shortly after logging the second bug I received this email.

When is testing supposed to end?

My instinct was that testing should never end. At the time, our product had only been released to dogfood, not production. There was no reason to stop testing since we could still fix critical bugs before they affected customers. And even if a bug was found after shipping to customers, it’s often possible to release an update to fix the problem.

But the more I thought about it, the more I realized it’s a perfectly legitimate question. Is there ever a time when testing can stop? If you’re shipping a “shrink-wrapped” product, then the answer is “yes”. If your product is a service, the answer is more complicated.

The minimum conditions required to stop testing are:

  1. Development for the feature is complete (no more code churn).
  2. The release criteria for the feature has been met.
  3. All failing tests and code coverage gaps have been investigated.

The first condition is that development has stopped and the code base is stable. As long as product code is changing, there’s a possibility new bugs are being introduced. The new code changes should be tested and regression tests executed.

Once the product code is stable, the release criteria must be met. The release criteria is typically agreed upon early in the test planning stage by both the Test and Dev teams. It can include conditions such as:

  • No open P1 (severe) bugs
  • 100% of tests executed
  • 95% test pass rate
  • 90% code coverage

You might think that once development is complete and your release criteria met, that it’s safe to stop testing; this is not the case. In the example above, the release criteria included a 95% automation pass rate. If your actual pass rate, however, is anything less than 100%, it’s imperative you investigate all failures. If they were caused by a bug that meets the bar, then the bug fix will violate our first condition that the code base remains stable, so testing must continue.

The same holds true for code coverage goals. Unless you have perfect code coverage, you need to address any gaps. This can be done either by creating new test cases, or by accepting the risk that these specific areas of code remain untested.

Once our three conditions are met, you can stop testing–if your product will be shrink-wrapped and shipped on a DVD. This doesn’t mean your product is bug-free; all relatively complex software has some bugs. But that’s why you created your release criteria—to manage the risk inherent in shipping software, not to eliminate it. If your minimum test pass rate or code coverage rate is anything less than 100%, you’re choosing to accept some reasonable amount of risk.

If your product is a service then your job isn’t done yet. You don’t need to continue running existing automation in test environments because the results will always be the same. But even though your product code isn’t changing, there are other factors that can affect your service. A password can expire, or a wrong value could be entered in a configuration file. There could be hardware failures or hacking attempts. The load may be more than you planned for. So even though traditional testing may have ended, you should now shift your efforts to monitoring the production environment and testing in production.

Who Tests the Watchers?

Back in February, in his blog titled “Monitoring your service platform – What and how much to monitor and alert?“,  Prakash discussed the monitoring of a service running in the cloud, the multitude of alerts that could be set up, and how testers need to trigger and test the alerts.  Toward the end he said:

Once it’s successful to simulate [the alerts], these tests results will provide confidence to the operations engineering team that they will be able to handle and manage them quickly and effectively.

When I read this part of his posting, one thing jumped into my Tester’s mind: Who is verifying that there is a working process in place to deal with the alerts once they start coming through from Production?  If you have an Operations team, do they already have a system in place that you can just ‘plug’ your alerts into?  If you don’t have a system in place, are they responsible for developing the system?

If you, the tester, has any doubt about the alert handling process, here are a few questions you might want to ask.  Think of it as a test plan for the process.

  • Is the process documented someplace accessible to everyone it affects?
  • Is there a CRM (Customer Relationship Management) or other tracking system for the alerts?
  • Is a system like SCOM  (System Center Operations Manager) being used that can automatically generate a ticket for the alert, and handle some of the alerts without human intervention?
  • Is there an SLA (Service Level Agreement) or OLA (Operations level Agreement) detailing turn-around time for each level of alert?
  • Who is on the hook for the first level of investigation?
  • What is the escalation path if the Operations team can’t debug the issue?  Is it the original product team?  Is there a separate sustainment team?
  • Who codes, tests, and deploys any code fix?

You might want to take an alert from the test system and inject it into the proposed production system, appropriately marked as a test, of course.  Does the system work as expected?

More information on monitoring and alerts can be found in the Microsoft Operations Framework.

Look to the Data

I was recently asked to investigate my team’s automated daily test cases. It was taking more than 24 hours to execute our “daily” test run. My job was to find why the tests were taking so long to complete, and speed them up when possible. In the process, an important lesson was reinforced: we should look to real, live customer data to guide our test planning.

I had several ideas to speed up our test passes. My highest priority was to find and address the test cases that took the longest to run. I sorted the tests by run-time, and discovered the slowest test case took over an hour to run. I tackled this first, as it would give me the biggest “bang for the buck”.

The test validated a function that took two input parameters. It iterated through each possible input combination, verifying the result. The code looked like this:

for setting1 = 1 to 5
{
  for setting2 = 0 to 5
  {
    validate(setting1, setting2);
  }
}

The validate function took two minutes to execute. Since it was called thirty times, the test case took an hour to complete. I first tried to improve the performance of the validate function, but the best I could do was shave a few seconds off its run-time.

My next thought was whether we really needed to test all 30 input combinations. I requested access to the live production data for these fields. I found just two setting combinations accounted for 97% of the two million production scenarios; four combinations accounted for almost 99%. Many of the combinations we were testing never occurred at all “in the real world.”

Most Common Data Combinations

setting1 setting2 % of Production Data
5 5 73%
1 5 24%
1 0 .9%
2 1 .8%
2 5 .7%
3 5 .21%
1 1 .2%
3 0 .1%
4 0 .09%

I reasoned that whatever change I made to this test case, the two most common combinations must always be tested. Executing just these two would test almost all the production scenarios, and the run-time would be cut to only four minutes. But notice that the top two combinations both have a setting2 value of 5. Validating just these two combinations would leave a huge hole in our test coverage.

I considered testing only the nine combinations found in production. This would guarantee we tested all the two million production scenarios, and our run-time would be cut from one hour to 18 minutes. The problem with this solution is that if a new combination occurred in production in the future, we wouldn’t have a test case for it; and if there was a bug, we wouldn’t know about it until it was too late.

Another possible strategy would be to split this test into multiple test cases, and use priorities to run the more common scenarios more often. For example, the two most common scenarios could be classified as P1, the next seven P2, and the rest P3. We would then configure our test passes to execute the P1s daily, the P2s weekly, and the P3s monthly.

The solution I decided on, however, was to keep it as one test case and always validate the two most common combinations as well as three other, randomly generated, combinations. This solution guaranteed we test at least 97% of the live production scenarios each night. All thirty combinations would be tested, on average, every 10 days–more often than the “priority” strategy I had considered. The solution reduced the test’s run-time from over an hour to about 10 minutes. The entire project was a success as well; we reduced the test pass run-time from 27 hours to less than 8 hours.

You may be thinking, I could have used pairwise testing to reduce the number of combinations I executed. Unfortunately, pairwise doesn’t help when you only have two input parameters. This strategy ensures each pair of parameters are validated, so the total number of combinations tested would have remained at thirty.

In hindsight, I think I could have been smarter about the random combinations I tested. For example, I could have ensured I generated at least one test case for each possible value of setting1 and setting2.

I also could have associated a “weight” with each combination based on how often it occurred in production. Settings would still be randomly chosen, but the most common ones would have a better chance of being generated. I would just have to be careful to assign some weight to those combinations that never appear in production; this would make sure all combinations are eventually tested. I think I’ll use this strategy the next time I run into a similar situation.

The Tester’s Dilemma: To Pad or Not to Pad

Dick Tracy's Dilemma

My first self-evaluation as a tester included statements like, “I added 50 new tests to this test suite” and “I automated 25 tests in that suite.” I thought the more tests I wrote, the more productive I was. I was wrong, and so are many testers who still feel that way. But it isn’t all our fault.

Early in my career I wrote a ton of tests, each validating one thing and one thing only. The main benefit of this strategy is that if a test failed, I knew exactly what failed with minimal investigation. An unexpected side effect was that it led me to write a lot of tests. And this, I thought, was an accurate indicator of how productive I was.

I later discovered this strategy also had undesirable side effects. I discussed these side-effects in an earlier article to much fanfare, so I won’t go into the details again. But the four disadvantages I see are:

  • Test passes take too long to complete
  • Results take too long to investigate
  • Code takes too much effort to maintain
  • Above a certain threshold, additional tests can mask product bugs

After six years in Test, it’s now obvious to me that it’s not necessarily better to write a lot of tests. I would now rather write one test that finds bugs, than a hundred that don’t. I would rather write one really efficient test that validates a complete scenario, than ten crappy ones that each validate only part of a scenario.

Yet even if it’s better to write fewer, more effective tests, not all testers have the incentive to do so. Are you confident your manager will know you’re working hard and doing a good job if you only have a handful of tests to show for your effort? Not all testers are.

I’m at the point in my career where I’m happy to say I have this confidence because my managers are familiar with the quality of my work. Some less experienced testers, however, face a dilemma: It’s better for their product to have fewer, more efficient tests; but it might be better for their career to write more, less efficient ones.

To be fair, never in my career have I been told that I’m doing a good job because I wrote a lot of tests, or, conversely, doing a bad job because I wrote too few. But sometimes the pressure was less direct.

I worked on one project where at the end of the kick-off meeting I was asked how long it would take to design and automate all of my tests. It was the first day of the project and I had no idea how many tests would be needed, so I asked for time to analyze the functional specifications. I was told we needed to quickly make a schedule and I should give my estimate based on fifty tests.

I had two issues with this question. First, why fifty? I’ll assume it was because fifty sounded like a reasonable number of tests that would help put something in the schedule. The schedule might be changed later, but it would be a good estimate to start with. (In hindsight, it wasn’t a very good estimate, as we actually wrote twice that many tests.)

My bigger problem was that this was a loaded question. I was now under pressure, subtle as it might be, to come up with close to fifty tests. What if I had then analyzed the specs and found that I could test the feature with just five efficient tests? Considering I had given an estimate based on fifty, would this have been viewed as really efficient testing, or really superficial testing?

To solve the tester’s dilemma we need to remove any incentive to pad our test count. We can do this by making sure our teams don’t use “test count” as a quality metric. Our goal should be quality, not quantity; and test count is not a good metric for either test quality or tester quality.

Luckily I’ve never worked on a team that used “test count” as a metric, but I know of teams that do. I also know of teams that use a similar metric: “bug count”. One tester I know spent most of his time developing automation, and yet was “dinged” for not logging as many bugs as the manual testers on the team. Much like “test count”, the number of bugs logged is not as important as the quality of those bugs. We should look at any metric that ends in the word “count” with skepticism.

We also need to keep an eye out for the more subtle forms of pressure to pad our test count. For example, hearing any of the following make me leery:

  • Automate fifty tests.
  • Design ten Build Verification Tests (BVTs).
  • Write a test plan four to five pages long.
  • 10% of your test cases should be classified as P1 (highest-priority).

All of these statements frame the number of tests we’re expected to create. While they’re fine as guidelines, they may also tempt you to add a few extra tests to reach fifty, or classify a couple of P2s as P1. And that can’t be good for your product or your customers.

%d bloggers like this: