Test In Production – What is it all about?

I would like to share my thoughts about test-in-production (a.k.a TiP.) This term has become a buzz word in the testers wonderland as the industry is moving more towards providing solutions in the cloud. Here are 4 easy questions thru which I plan to address this.

I like to explain this with an analogy of a box product in the olden days vs cloud services today. In earlier days when software was shipped as a box product or a downloadable executable, testing was much simpler, in a way. Those box products have well-defined system requirements like Operating system (type, version), supported locale, disk space, RAM, yada yada yada. So when testers define the test plan its self-contained within those boundaries defined by the product. When the end-user buys the box product it is at his own decision on which hardware he can install the software. It is a balanced equation I guess, i.e, what’s tested to what’s installed, and works as expected = success if end-user chooses the hardware meeting the system requirements.

With the evolution of today’s cloud oriented solutions, customers want solutions that optimize cost (which is one of the reason cloud is evolving, in my opinion). The companies providing the software service decides on the hardware to suit the scale and performance need. In reality, not all software is custom-made to a h/w. So there are many variables that are associated to the h/w when it comes to testing software services in the cloud. For example, when you host your solution that is used by 100’s of 1000s’ of users you can think of 10’s of 100’s of servers in the data center.

The small software once tested in 1 machine or multiple machines (depending on what software architecture you are testing) now becomes a huge network tied up to various levels of Service Level Agreement (SLA) like performance, latency, scale, fault tolerance, security, blah blah blah.  Although it is very much possible to simulate the data center kind of setup within your corporate environment  there may/will be lot of difference when it comes to the actual setup in the data center. Some of these may include, but are not limited to, load balancers, active directory credentials, different security policy applied on the hosts, domain controller configurations specific to your hosting setup, storage access credentials; and these are just the tip of the iceberg.

What

So what is TiP? My definition for TiP is the end-to-end customer test scenario you can author with the required input parameters and target to run constantly in a predefined interval against the software end points of the hosted service. This validates the functionality and component integration, and provides a binary result: Pass or Fail. There are at least 2 different types of TiP tests you can author: Outside-In(OI) and Inside-Out(IO).

Outside-In(OI): These tests run outside your production environment targeting your software end point.

Inside-Out(IO): These tests run from within your data center targeting different roles you may have to ensure they are all functioning properly.

Why

TiP enables you to proactively find any issues before you could hear from a customer. Since the tests are running against your live site, it is expected to have appropriate monitoring built into the architecture so that the failures from these critical tests are escalated accordingly and appropriate action is taken. TiP is a valuable asset to validate your deployment and any plumbing between different software role* you may have in your architecture. TiP plays a critical role during service deployment or upgrade as it runs end-to-end tests on the production systems before it can go live to take the real-world traffic. Automated TiP scenario tests may save a lot of the testers from manually validating the functionality in production system.

When

TiP is recommended to be running all the time, for as long as you keep your s/w service alive.

How

I’m not going to go into any design in how. Rather it’s a high level thought. Identify from your test plan a few critical test paths that cover both happy path and negative test cases. Give priority to the test case that cover maximum code path and components. For example, if your service has replication, SQL transaction, flush policy, etc., encapsulate all of this into a single test case and try to automate the complex path. This will help ensure that the whole pipeline in your architecture is servicing as expected. There is no right or wrong tools for this. From batch files and shell scripts, to C# and Ruby on Rails, it’s up to you to find the right tool set and language appropriate for the task.

*role – An installation or instance of the operating system serving a specific capability. For example, an authentication system could be one instance of the OS in your deployment whose functionality is just to authenticate all the traffic to access your service.

When Can Testing End?

Mohawk Stop SignLast year a program manager sent me an email I still think about. The product we were working on was just released to our dogfood environment, where it would be used by employees before being released to customers.

The next day we were still completing the analysis of our automated test failures. Our initial investigation determined that these failures shouldn’t hold up the release; they either didn’t meet the bug bar, or we thought they were caused by configuration issues in our test environment. But as I continued investigating, I discovered one of the failures actually was caused by a product bug that met the bar. Later that day, I found a second product bug. Shortly after logging the second bug I received this email.

When is testing supposed to end?

My instinct was that testing should never end. At the time, our product had only been released to dogfood, not production. There was no reason to stop testing since we could still fix critical bugs before they affected customers. And even if a bug was found after shipping to customers, it’s often possible to release an update to fix the problem.

But the more I thought about it, the more I realized it’s a perfectly legitimate question. Is there ever a time when testing can stop? If you’re shipping a “shrink-wrapped” product, then the answer is “yes”. If your product is a service, the answer is more complicated.

The minimum conditions required to stop testing are:

  1. Development for the feature is complete (no more code churn).
  2. The release criteria for the feature has been met.
  3. All failing tests and code coverage gaps have been investigated.

The first condition is that development has stopped and the code base is stable. As long as product code is changing, there’s a possibility new bugs are being introduced. The new code changes should be tested and regression tests executed.

Once the product code is stable, the release criteria must be met. The release criteria is typically agreed upon early in the test planning stage by both the Test and Dev teams. It can include conditions such as:

  • No open P1 (severe) bugs
  • 100% of tests executed
  • 95% test pass rate
  • 90% code coverage

You might think that once development is complete and your release criteria met, that it’s safe to stop testing; this is not the case. In the example above, the release criteria included a 95% automation pass rate. If your actual pass rate, however, is anything less than 100%, it’s imperative you investigate all failures. If they were caused by a bug that meets the bar, then the bug fix will violate our first condition that the code base remains stable, so testing must continue.

The same holds true for code coverage goals. Unless you have perfect code coverage, you need to address any gaps. This can be done either by creating new test cases, or by accepting the risk that these specific areas of code remain untested.

Once our three conditions are met, you can stop testing–if your product will be shrink-wrapped and shipped on a DVD. This doesn’t mean your product is bug-free; all relatively complex software has some bugs. But that’s why you created your release criteria—to manage the risk inherent in shipping software, not to eliminate it. If your minimum test pass rate or code coverage rate is anything less than 100%, you’re choosing to accept some reasonable amount of risk.

If your product is a service then your job isn’t done yet. You don’t need to continue running existing automation in test environments because the results will always be the same. But even though your product code isn’t changing, there are other factors that can affect your service. A password can expire, or a wrong value could be entered in a configuration file. There could be hardware failures or hacking attempts. The load may be more than you planned for. So even though traditional testing may have ended, you should now shift your efforts to monitoring the production environment and testing in production.

%d bloggers like this: