Test In Production – What is it all about?

I would like to share my thoughts about test-in-production (a.k.a TiP.) This term has become a buzz word in the testers wonderland as the industry is moving more towards providing solutions in the cloud. Here are 4 easy questions thru which I plan to address this.

I like to explain this with an analogy of a box product in the olden days vs cloud services today. In earlier days when software was shipped as a box product or a downloadable executable, testing was much simpler, in a way. Those box products have well-defined system requirements like Operating system (type, version), supported locale, disk space, RAM, yada yada yada. So when testers define the test plan its self-contained within those boundaries defined by the product. When the end-user buys the box product it is at his own decision on which hardware he can install the software. It is a balanced equation I guess, i.e, what’s tested to what’s installed, and works as expected = success if end-user chooses the hardware meeting the system requirements.

With the evolution of today’s cloud oriented solutions, customers want solutions that optimize cost (which is one of the reason cloud is evolving, in my opinion). The companies providing the software service decides on the hardware to suit the scale and performance need. In reality, not all software is custom-made to a h/w. So there are many variables that are associated to the h/w when it comes to testing software services in the cloud. For example, when you host your solution that is used by 100’s of 1000s’ of users you can think of 10’s of 100’s of servers in the data center.

The small software once tested in 1 machine or multiple machines (depending on what software architecture you are testing) now becomes a huge network tied up to various levels of Service Level Agreement (SLA) like performance, latency, scale, fault tolerance, security, blah blah blah.  Although it is very much possible to simulate the data center kind of setup within your corporate environment  there may/will be lot of difference when it comes to the actual setup in the data center. Some of these may include, but are not limited to, load balancers, active directory credentials, different security policy applied on the hosts, domain controller configurations specific to your hosting setup, storage access credentials; and these are just the tip of the iceberg.

What

So what is TiP? My definition for TiP is the end-to-end customer test scenario you can author with the required input parameters and target to run constantly in a predefined interval against the software end points of the hosted service. This validates the functionality and component integration, and provides a binary result: Pass or Fail. There are at least 2 different types of TiP tests you can author: Outside-In(OI) and Inside-Out(IO).

Outside-In(OI): These tests run outside your production environment targeting your software end point.

Inside-Out(IO): These tests run from within your data center targeting different roles you may have to ensure they are all functioning properly.

Why

TiP enables you to proactively find any issues before you could hear from a customer. Since the tests are running against your live site, it is expected to have appropriate monitoring built into the architecture so that the failures from these critical tests are escalated accordingly and appropriate action is taken. TiP is a valuable asset to validate your deployment and any plumbing between different software role* you may have in your architecture. TiP plays a critical role during service deployment or upgrade as it runs end-to-end tests on the production systems before it can go live to take the real-world traffic. Automated TiP scenario tests may save a lot of the testers from manually validating the functionality in production system.

When

TiP is recommended to be running all the time, for as long as you keep your s/w service alive.

How

I’m not going to go into any design in how. Rather it’s a high level thought. Identify from your test plan a few critical test paths that cover both happy path and negative test cases. Give priority to the test case that cover maximum code path and components. For example, if your service has replication, SQL transaction, flush policy, etc., encapsulate all of this into a single test case and try to automate the complex path. This will help ensure that the whole pipeline in your architecture is servicing as expected. There is no right or wrong tools for this. From batch files and shell scripts, to C# and Ruby on Rails, it’s up to you to find the right tool set and language appropriate for the task.

*role – An installation or instance of the operating system serving a specific capability. For example, an authentication system could be one instance of the OS in your deployment whose functionality is just to authenticate all the traffic to access your service.

Monitoring your service platform – What and how much to monitor and alert?

This is the first in a series of articles related to monitoring of cloud services. I’ve been working as a Software Test Lead on a cloud-based product for a few years, and I’d like to share my experiences. In this first article I’ll give a high-level overview of monitoring and alerts.

SaS, Software as Service, is the direction/trend the growing IT industry is moving towards. It provides quick, cheap, and easy management and access for storing and retrieving data. More and more corporations are making the shift to move their product/solutions to the cloud. What qualities do customers look for when they want their solution in the cloud? Security, reliability, scalability, cost, performance, reporting,  yada yada yada, or what? Assuming all these requirements are met — and met very well with an SLA of golden nines (99.999) — customers are happy. As long as all of the features in the product promised for the customer works, everyone is a happy camper. But that’s not the reality we testers live in.

Running software as services is not an easy task. The bigger the architecture of the system you build for your cloud, the  more complex it becomes to test and validate. One of the ways we minimize the risks in the software as services or in the services platform is to add appropriate alerts to monitor for symptoms which would require appropriate action to rectify. So is monitoring the system/platform the ultimate solution to keep your site reliable? Well I think it’s a combination of monitoring + the quality of the software.

Let’s consider an example hotel reservation system.  To be aware how the system is running in production I may have many alerts that monitor my web server, disk usage,  latency, order processing system, email notification system, etc. These are a very few of the logical components  I can imagine, but each component within the system may have multiple alerts. For example, in order to monitor the health of the web server here is a sample list of alerts you may choose to have:

Web server Monitoring:

  1. Alert me if CPU utilization is high
  2. Alert me if hard disk failed
  3. Alert me if certificate is expired
  4. Alert me if network resource is not reachable
  5. Alert me if replication is broken
  6. Alert me if latency greater than threshold
  7. Alert me if hard disk is running out of space
  8. Alert me if some unauthorized admin access activity happen
  9. Alert me if software patch on the server failed
  10. Alert me if my primary server failed over.
  11. Alert me …there could be many more

Great to see so many alerts can be caught in the system just for this one component (web server) for the example I’m talking about. Now expanding the same concept to the complete spectrum of your entire service design/architecture, you can imagine the number of monitoring points and alerts you need to build into the system with a potential burst in a number of alerts the system may generate in a real-time production environment. In general, when these alerts are designed it may be classified  in any of 3 general categories: Informational, Warning and Critical. Depending on the requirement,  each alert may result in an action or set of actions to mitigate the risks to the service. Given the complexity of the hotel reservation system, if you have only a handful of customers, managing the alerts for this scale may be easy and manageable. If the scale grows and you have thousands of customers, then the complexity increases. The alerts from the system could be specific to one set of logical customer groups, or it might be to the entire customer base, or to the entire service. Managing these alerts and resolving them in a timely fashion becomes one of the critical factors for customer satisfaction and could increase the COGS [Cost Of Goods Sold] for the service.

Now as a tester it’s our/my responsibility to validate all those alerts and make sure it works the way it’s supposed to. Wow. It sounds that simple when I wrote this sentence, in reality it’s a uh…a tall order. The team trying to achieve the magic nine SLA will attempt to put as many alerts here there and everywhere to catch any possible issues. When this large number of alerts is combined with the scale your service architecture is designed to work at, it may turn out to be a nightmare to identify legitimate alerts, manage them and resolve the issues. Over a period of time when something goes wrong with the system resulting in a service outage we tend to add more and more alerts. The bottom line is the more alerts you put into the system the more chances that it would end up creating noise, and over a period of time these noises may become overwhelming and ignored.  Ignored alerts may hide a legitimate alert, resulting in a disaster. As a tester, we should review all alerts and take the time to categorize them into appropriate buckets and ensure each alert leads to an action to rectify the problem. To test these buckets of alerts in the lab is a challenging task, as is simulating the failure points into the system to trigger the alerts. Once it’s successful to simulate, these tests results will provide confidence to the operations engineering team that they will be able to handle and manage them quickly and effectively.

Alerts are really important to the service. Be thoughtful on what alerts you add. It is really important to weigh the number of alerts you want to trigger from your system and keep it balanced so that quality of the service is maintained. Review all the alerts carefully and ensure that each results in an actionable item to fix the system. Keep alerts as alerts and don’t let them create noise into your system. Control the COGS for your service effectively and make the monitoring and alerting efficient.

Happy monitoring!

%d bloggers like this: