Monitoring your service platform – What and how much to monitor and alert?

This is the first in a series of articles related to monitoring of cloud services. I’ve been working as a Software Test Lead on a cloud-based product for a few years, and I’d like to share my experiences. In this first article I’ll give a high-level overview of monitoring and alerts.

SaS, Software as Service, is the direction/trend the growing IT industry is moving towards. It provides quick, cheap, and easy management and access for storing and retrieving data. More and more corporations are making the shift to move their product/solutions to the cloud. What qualities do customers look for when they want their solution in the cloud? Security, reliability, scalability, cost, performance, reporting,  yada yada yada, or what? Assuming all these requirements are met – and met very well with an SLA of golden nines (99.999) – customers are happy. As long as all of the features in the product promised for the customer works, everyone is a happy camper. But that’s not the reality we testers live in.

Running software as services is not an easy task. The bigger the architecture of the system you build for your cloud, the  more complex it becomes to test and validate. One of the ways we minimize the risks in the software as services or in the services platform is to add appropriate alerts to monitor for symptoms which would require appropriate action to rectify. So is monitoring the system/platform the ultimate solution to keep your site reliable? Well I think it’s a combination of monitoring + the quality of the software.

Let’s consider an example hotel reservation system.  To be aware how the system is running in production I may have many alerts that monitor my web server, disk usage,  latency, order processing system, email notification system, etc. These are a very few of the logical components  I can imagine, but each component within the system may have multiple alerts. For example, in order to monitor the health of the web server here is a sample list of alerts you may choose to have:

Web server Monitoring:

  1. Alert me if CPU utilization is high
  2. Alert me if hard disk failed
  3. Alert me if certificate is expired
  4. Alert me if network resource is not reachable
  5. Alert me if replication is broken
  6. Alert me if latency greater than threshold
  7. Alert me if hard disk is running out of space
  8. Alert me if some unauthorized admin access activity happen
  9. Alert me if software patch on the server failed
  10. Alert me if my primary server failed over.
  11. Alert me …there could be many more

Great to see so many alerts can be caught in the system just for this one component (web server) for the example I’m talking about. Now expanding the same concept to the complete spectrum of your entire service design/architecture, you can imagine the number of monitoring points and alerts you need to build into the system with a potential burst in a number of alerts the system may generate in a real-time production environment. In general, when these alerts are designed it may be classified  in any of 3 general categories: Informational, Warning and Critical. Depending on the requirement,  each alert may result in an action or set of actions to mitigate the risks to the service. Given the complexity of the hotel reservation system, if you have only a handful of customers, managing the alerts for this scale may be easy and manageable. If the scale grows and you have thousands of customers, then the complexity increases. The alerts from the system could be specific to one set of logical customer groups, or it might be to the entire customer base, or to the entire service. Managing these alerts and resolving them in a timely fashion becomes one of the critical factors for customer satisfaction and could increase the COGS [Cost Of Goods Sold] for the service.

Now as a tester it’s our/my responsibility to validate all those alerts and make sure it works the way it’s supposed to. Wow. It sounds that simple when I wrote this sentence, in reality it’s a uh…a tall order. The team trying to achieve the magic nine SLA will attempt to put as many alerts here there and everywhere to catch any possible issues. When this large number of alerts is combined with the scale your service architecture is designed to work at, it may turn out to be a nightmare to identify legitimate alerts, manage them and resolve the issues. Over a period of time when something goes wrong with the system resulting in a service outage we tend to add more and more alerts. The bottom line is the more alerts you put into the system the more chances that it would end up creating noise, and over a period of time these noises may become overwhelming and ignored.  Ignored alerts may hide a legitimate alert, resulting in a disaster. As a tester, we should review all alerts and take the time to categorize them into appropriate buckets and ensure each alert leads to an action to rectify the problem. To test these buckets of alerts in the lab is a challenging task, as is simulating the failure points into the system to trigger the alerts. Once it’s successful to simulate, these tests results will provide confidence to the operations engineering team that they will be able to handle and manage them quickly and effectively.

Alerts are really important to the service. Be thoughtful on what alerts you add. It is really important to weigh the number of alerts you want to trigger from your system and keep it balanced so that quality of the service is maintained. Review all the alerts carefully and ensure that each results in an actionable item to fix the system. Keep alerts as alerts and don’t let them create noise into your system. Control the COGS for your service effectively and make the monitoring and alerting efficient.

Happy monitoring!

System Architecture: Follow The Data

When I’m planning upcoming tasks for performance, scalability, or reliability testing, the first thing I do is learn the architecture of the system I’ll be working on. This helps me figure out the areas of the system that are most likely to fail.

How do I learn system architecture? I follow the data. Data has three states: it’s either at rest, in use, or in motion. Data at rest is stored in a database or on a file system and is infrequently used; data in use is stored in a database or on a file system and is frequently used; and data in motion is being transmitted between systems or stored in physical memory for reading and updating.

Here are some examples:

  • Data In Motion
    • A client application calling a web service.
    • A client mail transfer agent (MTA) sending an email message to a server MTA via the SMTP protocol.
    • One process calling a COM object in another process.
    • An email message being stored in RAM, so it can be scanned for viruses or spam.
    • A log being stored in RAM, so that it can be parsed.
    • Customer data being queried from a database and aggregated for presentation to the user.
  • Data In Use
    • Customer transactions stored in a database.
    • Program logs stored on disk.
  • Data At Rest
    • Archived databases

So how does this help in planning a testing effort? Usually after I learn and document a system architecture, the obvious weak areas identify themselves. Here are some example epiphanies:

  • “There will be 100 clients calling into that server’s web service…I wonder what the performance of that server will be? And I wonder what would happen if the service were unavailable?”
  • “That data is being stored in RAM during the transaction. How big can that data get? Will it exhaust the machine’s physical memory?”
  • “That data in RAM will be processed N times…how much CPU will that transaction take?”
  • “Those logs will be archived to the file share daily. How much data will be produced each day? Does that exceed the size of the file share?”

Following the data helps me quickly learn the architecture and plan the testing effort. What things do you do in order to learn system architecture?

The Tester’s Dilemma: To Pad or Not to Pad

Dick Tracy's Dilemma

My first self-evaluation as a tester included statements like, “I added 50 new tests to this test suite” and “I automated 25 tests in that suite.” I thought the more tests I wrote, the more productive I was. I was wrong, and so are many testers who still feel that way. But it isn’t all our fault.

Early in my career I wrote a ton of tests, each validating one thing and one thing only. The main benefit of this strategy is that if a test failed, I knew exactly what failed with minimal investigation. An unexpected side effect was that it led me to write a lot of tests. And this, I thought, was an accurate indicator of how productive I was.

I later discovered this strategy also had undesirable side effects. I discussed these side-effects in an earlier article to much fanfare, so I won’t go into the details again. But the four disadvantages I see are:

  • Test passes take too long to complete
  • Results take too long to investigate
  • Code takes too much effort to maintain
  • Above a certain threshold, additional tests can mask product bugs

After six years in Test, it’s now obvious to me that it’s not necessarily better to write a lot of tests. I would now rather write one test that finds bugs, than a hundred that don’t. I would rather write one really efficient test that validates a complete scenario, than ten crappy ones that each validate only part of a scenario.

Yet even if it’s better to write fewer, more effective tests, not all testers have the incentive to do so. Are you confident your manager will know you’re working hard and doing a good job if you only have a handful of tests to show for your effort? Not all testers are.

I’m at the point in my career where I’m happy to say I have this confidence because my managers are familiar with the quality of my work. Some less experienced testers, however, face a dilemma: It’s better for their product to have fewer, more efficient tests; but it might be better for their career to write more, less efficient ones.

To be fair, never in my career have I been told that I’m doing a good job because I wrote a lot of tests, or, conversely, doing a bad job because I wrote too few. But sometimes the pressure was less direct.

I worked on one project where at the end of the kick-off meeting I was asked how long it would take to design and automate all of my tests. It was the first day of the project and I had no idea how many tests would be needed, so I asked for time to analyze the functional specifications. I was told we needed to quickly make a schedule and I should give my estimate based on fifty tests.

I had two issues with this question. First, why fifty? I’ll assume it was because fifty sounded like a reasonable number of tests that would help put something in the schedule. The schedule might be changed later, but it would be a good estimate to start with. (In hindsight, it wasn’t a very good estimate, as we actually wrote twice that many tests.)

My bigger problem was that this was a loaded question. I was now under pressure, subtle as it might be, to come up with close to fifty tests. What if I had then analyzed the specs and found that I could test the feature with just five efficient tests? Considering I had given an estimate based on fifty, would this have been viewed as really efficient testing, or really superficial testing?

To solve the tester’s dilemma we need to remove any incentive to pad our test count. We can do this by making sure our teams don’t use “test count” as a quality metric. Our goal should be quality, not quantity; and test count is not a good metric for either test quality or tester quality.

Luckily I’ve never worked on a team that used “test count” as a metric, but I know of teams that do. I also know of teams that use a similar metric: “bug count”. One tester I know spent most of his time developing automation, and yet was “dinged” for not logging as many bugs as the manual testers on the team. Much like “test count”, the number of bugs logged is not as important as the quality of those bugs. We should look at any metric that ends in the word “count” with skepticism.

We also need to keep an eye out for the more subtle forms of pressure to pad our test count. For example, hearing any of the following make me leery:

  • Automate fifty tests.
  • Design ten Build Verification Tests (BVTs).
  • Write a test plan four to five pages long.
  • 10% of your test cases should be classified as P1 (highest-priority).

All of these statements frame the number of tests we’re expected to create. While they’re fine as guidelines, they may also tempt you to add a few extra tests to reach fifty, or classify a couple of P2s as P1. And that can’t be good for your product or your customers.

Video Is Worth A Thousand Words

When attempting to file a bug, some people are not the best at explaining the issue (you know who you are), and time is lost by triage trying to understand the issue, as well as by the filer trying to answer any questions by developers. There is also the risk that the bug will be mismarked as ‘no repro’ or ‘by design’ if it is not well understood.

Therefore, your best friend is very good repro notes to explain how a developer can reproduce the same issue on their own environment. It is even highly advised to add a picture of the issue and attach it to the bug to give the bug readers a quick and easy way to fully understand the issue at a glance. If you simply take a screen shot (PrtScr key), paste into Paint and add that file, you are ahead of the curve… but really, the “Snipping Tool” that ships with Win7 and forward is way easier to use and allows you to annotate the image before saving it.

Yeah, smarty pants… but what about an issue that involves a set of complicated Repro Steps?

If your issue contains a series of steps, or the reactions are hard to describe, what do you do then? A picture of the one event will not suffice. This is when you go to the movies. No, not the new action flick that was a remake of a much better foreign film…. I mean MAKE a movie of the bug!

There are several options to make movies of your mouse screen actions. I’ve tried quite a few, but the one that beats all the others in my opinion is one unfortunately available only to Microsoft internal employees… so this will not help in this post…. but I’ll detail the features that are essential to me in a screen recorder, and let you evaluate some of the options available out there.

image_thumb3 Screen Recorder is a product created by a developer here at Microsoft, and meets and exceeds my expectations for a good option in bug reporting. Unfortunately, it is not yet available to the general public, but it does illustrate what a good app for bug reporting should look like.

This app does lots of things correctly:

  • very simple UI (see above)
  • allows you to configure where the output file is written easily
  • outputs to Windows media file (some apps you have to do the final encoding yourself)
  • allows for audio recording (configured from the file output menu)
  • allows for pausing and resume
  • allows for full screen or selection of one running application

This method is perfect for adding a WMV to a bug or presentation to easily portray a bug. This is also great for tutorials, how-to wiki, and PowerPoint presentations.

One publicly available option I’ve tried is “My Screen Recorder”.  It has most of the options detailed above, but is not a simple UI, however I’d say it’s the best option at present.  Another was Microsoft’s Expression Encoder, which was very versatile, but way too complex for this application, and did not encode the video in same step as the recording, which was very time consuming.

Lastly, I should mention that Windows7 ships with it’s own bug recording software for reporting bugs to Microsoft. This software, called “Problem Steps Recorder” can be used to create a detailed HTML page that includes step by step screen shots of the repro, which could be an option on a machine where you can not install additional software. A detailed view of how to use this option is shown in this TechRepublic blog post.

It would be great to see if any one of you has additional options for screen casting software that you think is superior, and why. Please feel free to leave that suggestion in the comment section, so that we all can test it out, pun intended.

Please remember that videos make excellent supplements to a bug, but they do not substitute for good, searchable text describing issues and its repro steps. The bug should include good setup, initialization, and execution steps for the issue which most teams would consider mandatory. 

For bugs to be most useful, they should be reported containing all investigatory evidence:

  • Expected result/actual result notes
  • Source data/files
  • Exception details
  • Stack trace (if possible)
  • And of course… THE MOVIE!

So let’s go to the movies people… make bug resolution more efficient in the process!

Enjoy!

Performance Testing 101

Hi all. In this post I’ll go over the general approach my team uses when planning and executing performance tests in the lab.

Step 1: define the questions we’d like to answer

I view software testing as answering questions. At the beginning of a test effort we document the questions we would like to answer, and then we spend the rest of the milestone answering them.

Questions for performance generally fall into three categories:

  • Resource utilization — how much CPU, disk, memory, and network does a system use?
  • Throughput — how many operations per second can a system handle?
  • Latency — how long does it take one operation to complete?

Here are some examples:

  • How many operations per second can a server handle?
  • How many concurrent clients can a server handle?
  • If a server handles load for 2 weeks straight, does throughput or latency degrade? Do we leak memory?
  • If we keep adding new customer accounts to a system, at what point will the system fall over? Which component will fall over first?
  • When a user opens a file that is 1 GB in size, how long does it take to open? How much disk activity occurs during this process?
  • When a process connects to a remote server, how much network bandwidth is used?

We spend a lot of time thinking about the questions, because these questions guide the rest of the process.

Step 2: define the performance tests

The next step is to define the performance tests that will help us answer our questions. For each performance test we identify two things: 1.) expected load and 2.) key performance indicators (KPIs).

Load is the set of operations that are expected to occur in the system. All of these operations compete for resources and affect the throughput and latency. Usually a system will have multiple types of load all occurring at the same time, and thus we try to simulate all of these types of load in a performance test.

A mistake I’ve made in the past is to not identify all types of important load. Sometimes I’ve focused too closely on one type, and forgot that there were other operations in the system that affected performance. The lesson I’ve learned: don’t test in a vaccuum.

The second part of a performance test is the key performance indicators (KPIs). These are the variables we want to measure, along with the goals for each variable. We always gather data for system resources (CPU, disk, memory, and network). We also gather data for application-specific KPIs in latency and throughput.

Step 3: automate and execute the tests

Now that the plans are complete, we focus on automation and execution. For each performance test we automate the load generation and data (KPI) collection.

Load generators are written in C#. With each load generator we attempt to mimic the expected load in the system. For example, if the load is SMTP email messages, we’ll write a generator that implements the client side of an SMTP session. If the load is SQL transactions, we’ll write functions that simulate these transactions.

Besides load, we also need to automate the collection of KPIs. This usually means collecting Windows performance counters. The .NET Framework has a PerformanceCounter class that makes collection easy.

Once things are automated, the next step is to execute the tests. Sometimes we run the tests manually only once or twice. Other times we schedule the tests to run automatically on a periodic basis. Each approach provides value in different ways and the choice depends on the team’s goals.

Step 4: report results

After tests are executed, we collect, analyze, and report results. We usually create a document that summarizes the major findings of the testing. We publish the document towards the end of the milestone.

Additionally, sometimes results are shared with folks throughout the milestone while testing is taking place. This can happen either manually, or via an automated system. For example, on our current project we are utilizing a web-based performance dashboard that a peer team created. The performance tests publish data to the dashboard automatically at the end of each run.

Follow

Get every new post delivered to your Inbox.

Join 62 other followers