Test In Production – What is it all about?

I would like to share my thoughts about test-in-production (a.k.a TiP.) This term has become a buzz word in the testers wonderland as the industry is moving more towards providing solutions in the cloud. Here are 4 easy questions thru which I plan to address this.

I like to explain this with an analogy of a box product in the olden days vs cloud services today. In earlier days when software was shipped as a box product or a downloadable executable, testing was much simpler, in a way. Those box products have well-defined system requirements like Operating system (type, version), supported locale, disk space, RAM, yada yada yada. So when testers define the test plan its self-contained within those boundaries defined by the product. When the end-user buys the box product it is at his own decision on which hardware he can install the software. It is a balanced equation I guess, i.e, what’s tested to what’s installed, and works as expected = success if end-user chooses the hardware meeting the system requirements.

With the evolution of today’s cloud oriented solutions, customers want solutions that optimize cost (which is one of the reason cloud is evolving, in my opinion). The companies providing the software service decides on the hardware to suit the scale and performance need. In reality, not all software is custom-made to a h/w. So there are many variables that are associated to the h/w when it comes to testing software services in the cloud. For example, when you host your solution that is used by 100’s of 1000s’ of users you can think of 10’s of 100’s of servers in the data center.

The small software once tested in 1 machine or multiple machines (depending on what software architecture you are testing) now becomes a huge network tied up to various levels of Service Level Agreement (SLA) like performance, latency, scale, fault tolerance, security, blah blah blah.  Although it is very much possible to simulate the data center kind of setup within your corporate environment  there may/will be lot of difference when it comes to the actual setup in the data center. Some of these may include, but are not limited to, load balancers, active directory credentials, different security policy applied on the hosts, domain controller configurations specific to your hosting setup, storage access credentials; and these are just the tip of the iceberg.


So what is TiP? My definition for TiP is the end-to-end customer test scenario you can author with the required input parameters and target to run constantly in a predefined interval against the software end points of the hosted service. This validates the functionality and component integration, and provides a binary result: Pass or Fail. There are at least 2 different types of TiP tests you can author: Outside-In(OI) and Inside-Out(IO).

Outside-In(OI): These tests run outside your production environment targeting your software end point.

Inside-Out(IO): These tests run from within your data center targeting different roles you may have to ensure they are all functioning properly.


TiP enables you to proactively find any issues before you could hear from a customer. Since the tests are running against your live site, it is expected to have appropriate monitoring built into the architecture so that the failures from these critical tests are escalated accordingly and appropriate action is taken. TiP is a valuable asset to validate your deployment and any plumbing between different software role* you may have in your architecture. TiP plays a critical role during service deployment or upgrade as it runs end-to-end tests on the production systems before it can go live to take the real-world traffic. Automated TiP scenario tests may save a lot of the testers from manually validating the functionality in production system.


TiP is recommended to be running all the time, for as long as you keep your s/w service alive.


I’m not going to go into any design in how. Rather it’s a high level thought. Identify from your test plan a few critical test paths that cover both happy path and negative test cases. Give priority to the test case that cover maximum code path and components. For example, if your service has replication, SQL transaction, flush policy, etc., encapsulate all of this into a single test case and try to automate the complex path. This will help ensure that the whole pipeline in your architecture is servicing as expected. There is no right or wrong tools for this. From batch files and shell scripts, to C# and Ruby on Rails, it’s up to you to find the right tool set and language appropriate for the task.

*role – An installation or instance of the operating system serving a specific capability. For example, an authentication system could be one instance of the OS in your deployment whose functionality is just to authenticate all the traffic to access your service.

%d bloggers like this: