Performance Testing 101

Hi all. In this post I’ll go over the general approach my team uses when planning and executing performance tests in the lab.

Step 1: define the questions we’d like to answer

I view software testing as answering questions. At the beginning of a test effort we document the questions we would like to answer, and then we spend the rest of the milestone answering them.

Questions for performance generally fall into three categories:

  • Resource utilization — how much CPU, disk, memory, and network does a system use?
  • Throughput — how many operations per second can a system handle?
  • Latency — how long does it take one operation to complete?

Here are some examples:

  • How many operations per second can a server handle?
  • How many concurrent clients can a server handle?
  • If a server handles load for 2 weeks straight, does throughput or latency degrade? Do we leak memory?
  • If we keep adding new customer accounts to a system, at what point will the system fall over? Which component will fall over first?
  • When a user opens a file that is 1 GB in size, how long does it take to open? How much disk activity occurs during this process?
  • When a process connects to a remote server, how much network bandwidth is used?

We spend a lot of time thinking about the questions, because these questions guide the rest of the process.

Step 2: define the performance tests

The next step is to define the performance tests that will help us answer our questions. For each performance test we identify two things: 1.) expected load and 2.) key performance indicators (KPIs).

Load is the set of operations that are expected to occur in the system. All of these operations compete for resources and affect the throughput and latency. Usually a system will have multiple types of load all occurring at the same time, and thus we try to simulate all of these types of load in a performance test.

A mistake I’ve made in the past is to not identify all types of important load. Sometimes I’ve focused too closely on one type, and forgot that there were other operations in the system that affected performance. The lesson I’ve learned: don’t test in a vaccuum.

The second part of a performance test is the key performance indicators (KPIs). These are the variables we want to measure, along with the goals for each variable. We always gather data for system resources (CPU, disk, memory, and network). We also gather data for application-specific KPIs in latency and throughput.

Step 3: automate and execute the tests

Now that the plans are complete, we focus on automation and execution. For each performance test we automate the load generation and data (KPI) collection.

Load generators are written in C#. With each load generator we attempt to mimic the expected load in the system. For example, if the load is SMTP email messages, we’ll write a generator that implements the client side of an SMTP session. If the load is SQL transactions, we’ll write functions that simulate these transactions.

Besides load, we also need to automate the collection of KPIs. This usually means collecting Windows performance counters. The .NET Framework has a PerformanceCounter class that makes collection easy.

Once things are automated, the next step is to execute the tests. Sometimes we run the tests manually only once or twice. Other times we schedule the tests to run automatically on a periodic basis. Each approach provides value in different ways and the choice depends on the team’s goals.

Step 4: report results

After tests are executed, we collect, analyze, and report results. We usually create a document that summarizes the major findings of the testing. We publish the document towards the end of the milestone.

Additionally, sometimes results are shared with folks throughout the milestone while testing is taking place. This can happen either manually, or via an automated system. For example, on our current project we are utilizing a web-based performance dashboard that a peer team created. The performance tests publish data to the dashboard automatically at the end of each run.

The Case for Fewer Test Cases

Robotium Remote Testing

Testers are often encouraged to automate more and more test cases. At first glance, the case for more test cases makes sense—the more tests you have, the better your product is tested. Who can argue with that? I can.

Creating too many test cases leads to the condition known as “test case bloat”. This occurs when you have so many test cases that you spend a disproportionate amount of time executing, investigating, and maintaining these tests. This leaves little time for more important tasks, such as actually finding and resolving product issues. Test case bloat causes the following four problems:

1. Test passes take a long time to complete.

The longer it takes for your test pass to complete, the longer you have to wait before you can begin investigating the failures. I worked on one project where there were so many test cases, the daily test pass took 27 hours to finish. It’s hard to run a test pass every day when it takes more than 24 hours to complete.

2. Failure investigations take a long time to complete.

The more tests you have, the more failures you have to investigate. If your test pass takes a day to complete, and you have a mountain of failures to investigate, it could be two days or longer before a build is validated. This turn-around time may be tolerable if you’re shipping your product on a DVD. But when your software is a service, you may need to validate product changes a lot faster.

For example, the product I’m working on is an email service. If a customer is without email, it’s unacceptable for my team to take this long to validate a bug fix. Executing just the highest-priority tests to validate a hot-fix may be a valid compromise. If you have a lot of test cases, however, even this can take too long.

3. Tests take too much effort to maintain.

When your automation suffers from test case bloat, even subtle changes in product functionality can cause massive ripples in your existing test cases, drastically increasing the amount of time you spend maintaining them. This leaves little time for other, more valuable tasks, such as testing new features. It’s also a morale killer. Most testers I know— the really good ones, at least— don’t want to continually maintain the same test cases. They want to test new features and write new code.

4. After a certain threshold, more test cases no longer uncover product bugsthey mask them.

Most test cases only provide new information the first time they’re run. If the test passes, we can assume the feature works. If the test fails, we file a bug, which is eventually fixed by development, and the test case begins to pass. If it’s written well, the test will continue to pass unless a regression occurs.

Let’s assume we have 25 test cases that happily pass every time they’re run. At 3:00 a.m. an overtired developer then checks in a bug causing three tests to fail. Our pass rate would drop from 100% to an alarming 88%. The failures would be quickly investigated, and the perpetrator would be caught. Perhaps we would playfully mock him and make him wear a silly hat.

But what if we had 50 test cases? Three failures out of 50 test cases is a respectable 94% pass rate. What about a hundred or two hundred tests? With this many tests, it’s now very possible that there are some amount of failures in every pass simply due to test code problems; timing issues are a common culprit. The same three failures in two hundred tests is a 99% pass rate. But were these failures caused by expected timing issues, or a real product bug? If your team was pressed to get a hot-fix out the door to fix a live production issue, it may not investigate a 99% pass rate with as much vigor as an 88% pass rate.

Bloat Relief

If your automation suffers from test case bloat, you may be able to refactor your tests. But you can’t simply mash four or five tests with different validation points into a single test case. The more complicated a test, the more difficult it becomes to determine the cause and severity of failure.

You can, however, combine test cases when your validation points are similar, and the severity of a failure at each validation point is the same. For example, if you’re testing a UI dialog, you don’t need 50 different test cases to validate that 50 objects on the screen are all at their expected location. This can be done in one test.

You can also combine tests when you’re checking a single validation point, such as a database field, with different input combinations. Don’t create 50 different test cases that check the same field for 50 different data combinations. Create a single test case that loops through all combinations, validating the results.

When my test pass was taking 27 hours to complete, one solution we discussed was splitting the pass based on priority, feature, or some other criteria. If we had split it into three separate test passes, each would have taken only nine hours to finish. But this would have required three times as many servers. That may not be an issue if your test pass runs on a single server or virtual machines, however I’ve worked on automation that required more than twenty physical servers–tripling your server count is not always an option.

In addition to the techniques discussed above, pair-wise testing and equivalence class partitioning are tools that all testers should have in their arsenal. The ideal solution, however, is to prevent bloating before it even starts. When designing your test cases, it’s important to be aware of the number of tests you’re writing. If all else fails, I hear you can gain time by investigating your test failures while travelling at the speed of light.

Performance Test Documents

After reading Andrew’s article on test plans I started thinking about my own experiences with writing test documents. In this article I’ll describe the different types of performance test documents my team creates.

Test Strategy Documents

The first type is high-level test strategy documents. One of my team’s responsibilities is to provide guidance on performance tools, techniques, infrastructure, and goals. We document this guidance and share it with our peer feature teams. These teams use the guidance when planning out specific test cases for their features.

Strategy documents provide value in a number of ways. They help my team gain clarity on strategy, they act as documentation that feature teams can use to learn about performance testing, and they assist us in obtaining sign-off from stakeholders.

I wrote one strategy document this year and it provided all of the above. I still occasionally refer to it when I’m asked about the performance testing strategy for our organization.

Test Plan Documents

Besides creating strategy documents, we also write more traditional test plan documents. These documents define a set of tests that we intend to execute for a project milestone. They include details of the tests, features that will be covered, the expected performance goals, and the hardware and infrastructure that we will use.

Similar to strategy documents, test plans help us gain clarity on a project and act as a springboard for stakeholder review. They seem to have a shorter shelf life though — I don’t find myself reviewing old performance test plans. My approach has been to “write it, review it, and then forget about it.”

Interestingly enough, I do find myself reviewing old test plans authored by other teams. Occasionally we need to write performance tests for a feature that we’re unfamiliar with. The first thing I do is review old functional test plans to understand how the feature works and what the feature team thought were the most important test scenarios. These test plans are invaluable in getting us ramped up quickly.

Result Reports

When my team completes a milestone I like to write a report that details the results and conclusions of the performance testing. These reports contain performance results, information about bugs found, general observations, and anything else I think might be useful to document. I send the final report to stakeholders to help them understand the results of testing.

One thing I really like about these reports is that they help me figure out which types of testing provided the most value. They also help me figure out how we can improve. When I start planning a new milestone, I first go through the old reports to get ideas.

Wrapping Up

Documentation isn’t always fun but I do find that it provides value for me, my team, and the organization. I’d like to pose a question to readers — what types of test documents do you create, and how do they provide value?

Thanks for reading!

- Rob

Death by a Thousand Little Bugs

Software bug

Minor product defects that take only a few minutes to resolve are often never fixed; it seems there are always more important tasks to work on. If this sounds familiar, your test team may suffer from morale issues. And your product may suffer from “death by a thousand little bugs”. Fortunately, these problems can be fixed as easily as these bugs can.

Once testers get their hands on a feature, it doesn’t take long for low-priority defects to pile up in their bug-tracking database. These may include, for example, minor UI issues such as missing punctuation, inconsistent fonts, or grammar errors. These bugs tend to pile up because they are primarily cosmetic. Testers resolve the highest-priority bugs first–often rightly so. We should fix bugs that greatly affect functionality, performance, or security before fixing a spelling typo in the UI.

What can happen, however, is that we never fix many of these low-priority bugs. There are often more critical defects being discovered, so we continuously postpone the low-priority ones.

Unfortunately, some of the bugs left behind are those that were logged the earliest. There are few things I find more frustrating than reporting a simple bug that doesn’t get fixed. My typical complaint sounds something like this: “Why hasn’t this bug been fixed? I logged it weeks ago. It’s a one-line change that will take only two minutes to fix!”

A previous project I worked on provides a perfect example. Not long after I was given the first working build of the UI, I logged two minor bugs. One issue was logged because two buttons on the same page were not aligned properly. The other bug was simply that a sentence ended with an extra period. When the product was released more than four months later, the misaligned buttons and the extra period were still there.

Another problem is that even if these low-impact bugs don’t affect functionality, they can greatly affect the customer’s perception of the product. How can a customer fully trust a product, no matter how well it actually works, if there are mountains of minor defects? This is the “death by a thousand little bugs” syndrome.

Before I came to Microsoft, I ran an online store. One night I modified the shopping cart page, and the next day sales plummeted. When I reviewed the changes I had made, I realized that I misspelled two words and added a broken image link. I fixed these issues and sales quickly went back to normal.

The functionality of the page hadn’t changed at all. But potential customers saw the “minor” errors and assumed the entire shopping cart had poor quality. They certainly didn’t rationalize, “They must have spent all their effort making sure the functionality was solid. That’s why they postponed these obvious, but low-priority bugs.”

The “death by a thousand little bugs” syndrome exists because most teams evaluate each bug individually–and individually, each of these bugs is trivial; but in the aggregate, they are not. Collectively, they make users skeptical of your product.

The solution is that we shouldn’t always address high-priority bugs before low-priority bugs. But when do we make the exceptions? Here are three strategies that I think could help solve these problems.

  1. Set aside one day each month for developers to address the low-priority, low-hanging-fruit bugs. This is a great way to fix a lot of bugs in a short amount of time. It can also prevent your product from suffering from “death by a thousand little bugs.”
  2. Put aside one day every month to fix the defects that have been in the bug database the longest–regardless of priority. This helps prevent testers from becoming demoralized because bugs they logged months ago still haven’t been fixed.
  3. Once a month, increase the priority of all bugs that are least 30 days old. Developers can continue to pull bugs out of the queue in priority order, but the difference is that after one month, a bug that was logged as P4 (lowest priority) becomes a P3. After three months, it becomes a high-priority P1 bug. It may initially sound odd that low-priority defects, such as a misspelled word in a log file, will eventually be classified as highest priority. But doing so forces some action to be taken on the bug. As a P1, it now must either be fixed or closed by the Programmer Manager as “Won’t Fix”.

You may be thinking, “but I’m a tester, and these solutions have nothing to do with testers.” When I started in Test, that’s how I thought. I now realize that my primary responsibility is to make my product better, not just to log bugs. If these strategies would work well for your team, then you should lobby for them–they may even increase your own morale along the way.

Do you think any of these strategies work well for your team? What strategies have you tried in the past, and how have they worked? I’m very interested in hearing your comments.

Where’s My Bottleneck?

Hello! My name is Rob Tougher and I’ll be contributing content to the Expert Testers blog. I’m excited for the opportunity to write about testing at Microsoft. Many thanks to Andrew for organizing this effort and setting things up.

I lead a team that focuses on performance and reliability testing. We spend a lot of time analyzing and diagnosing performance issues in our test labs and production datacenters. In future blog posts I’ll describe concrete examples of these investigations. Today I’ll outline the general steps we take in order to diagnose a performance issue.

An investigation usually starts with a question like this:

  • “I’m sending mail to an Exchange server and I expect to be able to send at least 200 msgs/s. At the moment the machine will accept only 100 msgs/s. Why is this happening?”

At a high-level our approach is to “measure first, then analyze”. We try not to jump to conclusions and instead make decisions based on data collected by various tools. The following three steps describe the process.

Step 1: reproduce the problem

The first step is to reproduce the problem. It’s helpful to have a simple repro of the situation that we can run as many times as necessary to identify the root cause of the issue. If we can create a repro in the test lab, that’s great and I consider us lucky. Sometimes we don’t have this luxury (it’s too difficult, would take too long, etc) and need to observe servers directly in  one of our production datacenters.

In either case, we try to answer these questions:

  • What are the steps to reproduce?
  • Which servers are involved?
  • What are the software and operating system versions?
  • What are the hardware configurations for the servers?
  • Does the issue occur at the same time every day?
  • Which data is involved?

The goal is to find a simple configuration and set of steps that allow for an easy repro.

Step 2: identify the bottleneck

Now that we (hopefully) have a reproducible test case, the next step is to identify the resource that is the bottleneck. A bottleneck will be either CPU, disk, memory, network, or an operating system entity like locks.

The Microsoft PFE Performance Guide is a great tutorial on finding a resource bottleneck. This guide is authored by Microsoft PFEs who use these steps while diagnosing issues in the field. My team can usually find the bottleneck quickly using Windows Performance Monitor and the these techniques.

Step 3: identify the root cause of the bottleneck

This is the tough part. Now that we know which resource is the bottleneck, we need to figure out why. Accomplishing this is different for each type of resource.   Here are some guidelines we follow for each resource:

  • CPU
  • Disk
  • Memory
    • Managed code? Use Windbg/SOS to analyze the process heap
    • Native code? Use DebugDiag (I haven’t tried it yet)
  • Network
    • Use ProcessExplorer to figure out which process is most chatty on the network
  • Locks
    • Managed code? Use SyncBlk in WinDbg to analyze lock information

(As I write this I get the feeling that the above might be good blog entries in the future.)

Besides the resource-specific strategies, there are some general things we also try to keep in mind:

  • Get symbols working. Symbols are invaluable.
  • Use the scientific method.
  • Don’t make guesses or jump to conclusions — measure, then analyze.

Wrapping Up

Thanks for reading! If you have any questions let me know.

- Rob Tougher

Follow

Get every new post delivered to your Inbox.

Join 290 other followers

%d bloggers like this: