Monitoring your service platform – What and how much to monitor and alert?

This is the first in a series of articles related to monitoring of cloud services. I’ve been working as a Software Test Lead on a cloud-based product for a few years, and I’d like to share my experiences. In this first article I’ll give a high-level overview of monitoring and alerts.

SaS, Software as Service, is the direction/trend the growing IT industry is moving towards. It provides quick, cheap, and easy management and access for storing and retrieving data. More and more corporations are making the shift to move their product/solutions to the cloud. What qualities do customers look for when they want their solution in the cloud? Security, reliability, scalability, cost, performance, reporting,  yada yada yada, or what? Assuming all these requirements are met — and met very well with an SLA of golden nines (99.999) — customers are happy. As long as all of the features in the product promised for the customer works, everyone is a happy camper. But that’s not the reality we testers live in.

Running software as services is not an easy task. The bigger the architecture of the system you build for your cloud, the  more complex it becomes to test and validate. One of the ways we minimize the risks in the software as services or in the services platform is to add appropriate alerts to monitor for symptoms which would require appropriate action to rectify. So is monitoring the system/platform the ultimate solution to keep your site reliable? Well I think it’s a combination of monitoring + the quality of the software.

Let’s consider an example hotel reservation system.  To be aware how the system is running in production I may have many alerts that monitor my web server, disk usage,  latency, order processing system, email notification system, etc. These are a very few of the logical components  I can imagine, but each component within the system may have multiple alerts. For example, in order to monitor the health of the web server here is a sample list of alerts you may choose to have:

Web server Monitoring:

  1. Alert me if CPU utilization is high
  2. Alert me if hard disk failed
  3. Alert me if certificate is expired
  4. Alert me if network resource is not reachable
  5. Alert me if replication is broken
  6. Alert me if latency greater than threshold
  7. Alert me if hard disk is running out of space
  8. Alert me if some unauthorized admin access activity happen
  9. Alert me if software patch on the server failed
  10. Alert me if my primary server failed over.
  11. Alert me …there could be many more

Great to see so many alerts can be caught in the system just for this one component (web server) for the example I’m talking about. Now expanding the same concept to the complete spectrum of your entire service design/architecture, you can imagine the number of monitoring points and alerts you need to build into the system with a potential burst in a number of alerts the system may generate in a real-time production environment. In general, when these alerts are designed it may be classified  in any of 3 general categories: Informational, Warning and Critical. Depending on the requirement,  each alert may result in an action or set of actions to mitigate the risks to the service. Given the complexity of the hotel reservation system, if you have only a handful of customers, managing the alerts for this scale may be easy and manageable. If the scale grows and you have thousands of customers, then the complexity increases. The alerts from the system could be specific to one set of logical customer groups, or it might be to the entire customer base, or to the entire service. Managing these alerts and resolving them in a timely fashion becomes one of the critical factors for customer satisfaction and could increase the COGS [Cost Of Goods Sold] for the service.

Now as a tester it’s our/my responsibility to validate all those alerts and make sure it works the way it’s supposed to. Wow. It sounds that simple when I wrote this sentence, in reality it’s a uh…a tall order. The team trying to achieve the magic nine SLA will attempt to put as many alerts here there and everywhere to catch any possible issues. When this large number of alerts is combined with the scale your service architecture is designed to work at, it may turn out to be a nightmare to identify legitimate alerts, manage them and resolve the issues. Over a period of time when something goes wrong with the system resulting in a service outage we tend to add more and more alerts. The bottom line is the more alerts you put into the system the more chances that it would end up creating noise, and over a period of time these noises may become overwhelming and ignored.  Ignored alerts may hide a legitimate alert, resulting in a disaster. As a tester, we should review all alerts and take the time to categorize them into appropriate buckets and ensure each alert leads to an action to rectify the problem. To test these buckets of alerts in the lab is a challenging task, as is simulating the failure points into the system to trigger the alerts. Once it’s successful to simulate, these tests results will provide confidence to the operations engineering team that they will be able to handle and manage them quickly and effectively.

Alerts are really important to the service. Be thoughtful on what alerts you add. It is really important to weigh the number of alerts you want to trigger from your system and keep it balanced so that quality of the service is maintained. Review all the alerts carefully and ensure that each results in an actionable item to fix the system. Keep alerts as alerts and don’t let them create noise into your system. Control the COGS for your service effectively and make the monitoring and alerting efficient.

Happy monitoring!

2 Responses

  1. […] in February, in his blog titled “Monitoring your service platform – What and how much to monitor and alert?“,  Prakash discussed the monitoring of a service running in the cloud, the multitude of […]

  2. Thanks Prakash, I really like the post!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: