Where’s My Bottleneck?

Hello! My name is Rob and I’ll be contributing content to the Expert Testers blog. I’m excited for the opportunity to write about testing at Microsoft. Many thanks to Andrew for organizing this effort and setting things up.

I lead a team that focuses on performance and reliability testing. We spend a lot of time analyzing and diagnosing performance issues in our test labs and production datacenters. In future blog posts I’ll describe concrete examples of these investigations. Today I’ll outline the general steps we take in order to diagnose a performance issue.

An investigation usually starts with a question like this:

  • “I’m sending mail to an Exchange server and I expect to be able to send at least 200 msgs/s. At the moment the machine will accept only 100 msgs/s. Why is this happening?”

At a high-level our approach is to “measure first, then analyze”. We try not to jump to conclusions and instead make decisions based on data collected by various tools. The following three steps describe the process.

Step 1: reproduce the problem

The first step is to reproduce the problem. It’s helpful to have a simple repro of the situation that we can run as many times as necessary to identify the root cause of the issue. If we can create a repro in the test lab, that’s great and I consider us lucky. Sometimes we don’t have this luxury (it’s too difficult, would take too long, etc) and need to observe servers directly in  one of our production datacenters.

In either case, we try to answer these questions:

  • What are the steps to reproduce?
  • Which servers are involved?
  • What are the software and operating system versions?
  • What are the hardware configurations for the servers?
  • Does the issue occur at the same time every day?
  • Which data is involved?

The goal is to find a simple configuration and set of steps that allow for an easy repro.

Step 2: identify the bottleneck

Now that we (hopefully) have a reproducible test case, the next step is to identify the resource that is the bottleneck. A bottleneck will be either CPU, disk, memory, network, or an operating system entity like locks.

The Microsoft PFE Performance Guide is a great tutorial on finding a resource bottleneck. This guide is authored by Microsoft PFEs who use these steps while diagnosing issues in the field. My team can usually find the bottleneck quickly using Windows Performance Monitor and the these techniques.

Step 3: identify the root cause of the bottleneck

This is the tough part. Now that we know which resource is the bottleneck, we need to figure out why. Accomplishing this is different for each type of resource.   Here are some guidelines we follow for each resource:

  • CPU
  • Disk
  • Memory
    • Managed code? Use Windbg/SOS to analyze the process heap
    • Native code? Use DebugDiag (I haven’t tried it yet)
  • Network
    • Use ProcessExplorer to figure out which process is most chatty on the network
  • Locks
    • Managed code? Use SyncBlk in WinDbg to analyze lock information

(As I write this I get the feeling that the above might be good blog entries in the future.)

Besides the resource-specific strategies, there are some general things we also try to keep in mind:

  • Get symbols working. Symbols are invaluable.
  • Use the scientific method.
  • Don’t make guesses or jump to conclusions — measure, then analyze.

Wrapping Up

Thanks for reading! If you have any questions let me know.

– Rob T

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: