Instagram for Mongo

When Instagram finally released an app for Windows phone­, a year after its release on iPhone and Android, it was missing perhaps its most important feature — the ability to take a picture within the app. Instagram explained they wanted to get the app to users as quickly as possible, and although a few features were missing, they assured everyone these features would come in a future release. Welcome to the new paradigm of shipping software to the cloud.

When I started in Test, we shipped our products on DVD. If a feature was missing or broken, the customer either had to wait for a new DVD (which might take a year or three) or download and install a service pack. Today, our software is a service (Saas). We ship to the cloud, and all customers are upgraded at once. For the Instagram app, the software is upgraded in the cloud, customers are automatically notified, and they can upgrade their phone with a single click.

Mongo likes Instagram.

Mongo likes Instagram.

In both cases, updating software is easier than ever. This allows companies to get their products to market quicker than ever. There’s no longer a need to develop an entire product before shipping. You can develop a small feature set, possibly with some known (but benign) bugs, and then iterate, adding new scenarios and fixing bugs.

In How Google Tests Software, James Whittaker explains that Google rarely ships a large set of features at once. Instead, they build the core of the product and release it the moment it’s useful to as many customers as possible. Whittaker calls this the “minimal useful product”. Then they iterate on this first version. Sometimes Google ships these early versions with a Beta tag, as they did with Android and Gmail, which kept its Beta tag for 4 years.

When I first read Whittaker’s book, I had a hard time accepting you should release what’s essentially an unfinished product. But I hadn’t worked on a cloud service yet, either. Now that I’ve done so for a few years, I’m convinced this is the right approach. This process can benefit everyone affected by the product, from your customers, to your management, to your (wait for it) self.

Customers

  1. By shipping your core scenarios as soon as they’re ready, customers can take advantage of them without waiting for less important scenarios to be developed.

    I worked on a project where the highest priority was to expose a set of data in our UI. We completed this in about two weeks. We knew, however, some customers needed to edit this data, and not just see it. So we decided to ship only after the read/write functionality was complete. Competing priorities and unexpected complications made this project take longer than expected. As a result, customers didn’t get their highest-priority scenario, the ability to simply view the data, for 2 months, rather than 2 weeks.

  2. When a product is large and complex, scenarios can go untested due to either oversight or test debt. With a smaller feature set, it’s less likely you’ll overlook scenarios. And with this approach, there shouldn’t be any test debt. Remember, Whittaker is not saying to release a buggy product with a lot of features, and then iterate fixing the bugs. He’s saying to ship a high-quality product with a small feature set, and iterate by adding features.
  3. Many SaaS bugs are due to deployment and scale issues, rather than functional issues. Using an iterative approach, you’ll find and address these bugs quickly because you’re releasing features quickly. By the time you’re developing the last feature, you’ll hopefully have addressed all these issues.
  4. Similarly, you’ll be able to incorporate customer feedback into the product before the last feature has even been developed.
  5. Customers like to get updates! On my phone, I have some apps that are updated every few weeks, and others that are updated twice a year. Sure, it’s possible the apps modified twice a year are actually getting bigger updates than the ones updated often, but it sure doesn’t feel that way.

Management

  1. This approach gets new features to market quicker, giving your company a competitive advantage over, well, your competitors.
  2. By releasing smaller, high-quality features, fewer bugs will be found by customers. And bugs reported by customers tend to be highly visible to, and frowned upon by, management.

You

  1. If you work on a cloud product, there’s undoubtedly an on-call team supporting it. You’re probably part of it. By releasing smaller feature sets with fewer bugs, the on-call team will receive fewer customer escalations, and be woken up fewer times at night. You’re welcome.

The Evolution of Software Testing

As I sat down to eat my free Microsoft oatmeal this morning, I noticed how dusty my Ship-It plaque had become. I hadn’t given it much attention lately, but today it struck me how these awards illustrate the evolution of software testing.

Since I started at Microsoft seven years ago, I’ve earned nine Ship-It awards — plaques given to recognize your contribution to releasing a product. The first Ship-It I “earned” was in 2006 for Microsoft Antigen, an antivirus solution for Exchange.

I put “earned” in quotes because I only worked on Antigen a couple of days before it was released. A few days into my new job, I found myself on a gambling cruise — with open bar! — to celebrate Antigen’s release. I was also asked to sign a comically over-sized version of the product DVD box. Three years later I received another Ship-It — and signed another over-sized box — for Antigen’s successor, “Forefront Protection 2010 for Exchange Server”.

Fast-forward to 2011, when I received my plaque for Microsoft Office 365. This time there was no DVD box to sign — because the product wasn’t released on DVD. It’s a cloud-based productivity suite featuring email and Office.

This got me thinking. In 2006, when we shipped Antigen, all the features we wanted to include had to be fully developed, and mostly bug-free, by the day the DVD was cut. After all, it would be another three years before we cut a new one. And it would be a terrible experience for a customer to install our DVD, only to then have to download and install a service pack to address an issue.

By 2011, however, “shipping” meant something much different. There was no DVD to cut. The product was “released” to the cloud. When we want to update Office 365, we patch it in the cloud ourselves without troubling the customer; the change simply “lights up” for them.

This means products no longer have to include all features on day one. If a low-priority feature isn’t quite ready, we can weigh the impact of delaying the release to include the feature, versus shipping immediately and patching it later. The same hold true if/when bugs are found.

What does this all mean for Test? In the past, it was imperative to meet a strict set of release criteria before cutting a DVD. For example, no more code churn, test-pass rates above 95%, code coverage above 65%, etc. Now that we can patch bugs quicker than ever, do these standards still hold? Have our jobs become easier?

We should be so lucky.

In fact, our jobs have become harder. You still don’t want to release a buggy product — customers would flock to your competitors, regardless of how quickly you fix the bugs. And you certainly don’t want to ship a product with security issues that compromise your customer’s data.

Furthermore, it turns out delivering to the cloud isn’t “free.” It’s hard work! Patching a bug might mean updating thousands of servers and keeping product versions in sync across all of them.

Testers now have a new set of problems. How often should you release? What set of bugs need to be fixed before shipping, and which ones can we fix after shipping? How do you know everything will work when it’s deployed to the cloud? If the deployment fails, how do you roll it back? If the deployment works, how do you know it continues to work — that servers aren’t crashing, slowing down, or being hacked? Who gets notified when a feature stops working in the middle of the night? (Spoiler alert: you do.)

I hope to explore some of these issues in upcoming articles. Unfortunately, I’m not sure when I’ll have time to do that. I’m on-call 24×7 for two weeks in December.

Test In Production – What is it all about?

I would like to share my thoughts about test-in-production (a.k.a TiP.) This term has become a buzz word in the testers wonderland as the industry is moving more towards providing solutions in the cloud. Here are 4 easy questions thru which I plan to address this.

I like to explain this with an analogy of a box product in the olden days vs cloud services today. In earlier days when software was shipped as a box product or a downloadable executable, testing was much simpler, in a way. Those box products have well-defined system requirements like Operating system (type, version), supported locale, disk space, RAM, yada yada yada. So when testers define the test plan its self-contained within those boundaries defined by the product. When the end-user buys the box product it is at his own decision on which hardware he can install the software. It is a balanced equation I guess, i.e, what’s tested to what’s installed, and works as expected = success if end-user chooses the hardware meeting the system requirements.

With the evolution of today’s cloud oriented solutions, customers want solutions that optimize cost (which is one of the reason cloud is evolving, in my opinion). The companies providing the software service decides on the hardware to suit the scale and performance need. In reality, not all software is custom-made to a h/w. So there are many variables that are associated to the h/w when it comes to testing software services in the cloud. For example, when you host your solution that is used by 100’s of 1000s’ of users you can think of 10’s of 100’s of servers in the data center.

The small software once tested in 1 machine or multiple machines (depending on what software architecture you are testing) now becomes a huge network tied up to various levels of Service Level Agreement (SLA) like performance, latency, scale, fault tolerance, security, blah blah blah.  Although it is very much possible to simulate the data center kind of setup within your corporate environment  there may/will be lot of difference when it comes to the actual setup in the data center. Some of these may include, but are not limited to, load balancers, active directory credentials, different security policy applied on the hosts, domain controller configurations specific to your hosting setup, storage access credentials; and these are just the tip of the iceberg.

What

So what is TiP? My definition for TiP is the end-to-end customer test scenario you can author with the required input parameters and target to run constantly in a predefined interval against the software end points of the hosted service. This validates the functionality and component integration, and provides a binary result: Pass or Fail. There are at least 2 different types of TiP tests you can author: Outside-In(OI) and Inside-Out(IO).

Outside-In(OI): These tests run outside your production environment targeting your software end point.

Inside-Out(IO): These tests run from within your data center targeting different roles you may have to ensure they are all functioning properly.

Why

TiP enables you to proactively find any issues before you could hear from a customer. Since the tests are running against your live site, it is expected to have appropriate monitoring built into the architecture so that the failures from these critical tests are escalated accordingly and appropriate action is taken. TiP is a valuable asset to validate your deployment and any plumbing between different software role* you may have in your architecture. TiP plays a critical role during service deployment or upgrade as it runs end-to-end tests on the production systems before it can go live to take the real-world traffic. Automated TiP scenario tests may save a lot of the testers from manually validating the functionality in production system.

When

TiP is recommended to be running all the time, for as long as you keep your s/w service alive.

How

I’m not going to go into any design in how. Rather it’s a high level thought. Identify from your test plan a few critical test paths that cover both happy path and negative test cases. Give priority to the test case that cover maximum code path and components. For example, if your service has replication, SQL transaction, flush policy, etc., encapsulate all of this into a single test case and try to automate the complex path. This will help ensure that the whole pipeline in your architecture is servicing as expected. There is no right or wrong tools for this. From batch files and shell scripts, to C# and Ruby on Rails, it’s up to you to find the right tool set and language appropriate for the task.

*role – An installation or instance of the operating system serving a specific capability. For example, an authentication system could be one instance of the OS in your deployment whose functionality is just to authenticate all the traffic to access your service.

Who Tests the Watchers?

Back in February, in his blog titled “Monitoring your service platform – What and how much to monitor and alert?“,  Prakash discussed the monitoring of a service running in the cloud, the multitude of alerts that could be set up, and how testers need to trigger and test the alerts.  Toward the end he said:

Once it’s successful to simulate [the alerts], these tests results will provide confidence to the operations engineering team that they will be able to handle and manage them quickly and effectively.

When I read this part of his posting, one thing jumped into my Tester’s mind: Who is verifying that there is a working process in place to deal with the alerts once they start coming through from Production?  If you have an Operations team, do they already have a system in place that you can just ‘plug’ your alerts into?  If you don’t have a system in place, are they responsible for developing the system?

If you, the tester, has any doubt about the alert handling process, here are a few questions you might want to ask.  Think of it as a test plan for the process.

  • Is the process documented someplace accessible to everyone it affects?
  • Is there a CRM (Customer Relationship Management) or other tracking system for the alerts?
  • Is a system like SCOM  (System Center Operations Manager) being used that can automatically generate a ticket for the alert, and handle some of the alerts without human intervention?
  • Is there an SLA (Service Level Agreement) or OLA (Operations level Agreement) detailing turn-around time for each level of alert?
  • Who is on the hook for the first level of investigation?
  • What is the escalation path if the Operations team can’t debug the issue?  Is it the original product team?  Is there a separate sustainment team?
  • Who codes, tests, and deploys any code fix?

You might want to take an alert from the test system and inject it into the proposed production system, appropriately marked as a test, of course.  Does the system work as expected?

More information on monitoring and alerts can be found in the Microsoft Operations Framework.

Monitoring your service platform – What and how much to monitor and alert?

This is the first in a series of articles related to monitoring of cloud services. I’ve been working as a Software Test Lead on a cloud-based product for a few years, and I’d like to share my experiences. In this first article I’ll give a high-level overview of monitoring and alerts.

SaS, Software as Service, is the direction/trend the growing IT industry is moving towards. It provides quick, cheap, and easy management and access for storing and retrieving data. More and more corporations are making the shift to move their product/solutions to the cloud. What qualities do customers look for when they want their solution in the cloud? Security, reliability, scalability, cost, performance, reporting,  yada yada yada, or what? Assuming all these requirements are met — and met very well with an SLA of golden nines (99.999) — customers are happy. As long as all of the features in the product promised for the customer works, everyone is a happy camper. But that’s not the reality we testers live in.

Running software as services is not an easy task. The bigger the architecture of the system you build for your cloud, the  more complex it becomes to test and validate. One of the ways we minimize the risks in the software as services or in the services platform is to add appropriate alerts to monitor for symptoms which would require appropriate action to rectify. So is monitoring the system/platform the ultimate solution to keep your site reliable? Well I think it’s a combination of monitoring + the quality of the software.

Let’s consider an example hotel reservation system.  To be aware how the system is running in production I may have many alerts that monitor my web server, disk usage,  latency, order processing system, email notification system, etc. These are a very few of the logical components  I can imagine, but each component within the system may have multiple alerts. For example, in order to monitor the health of the web server here is a sample list of alerts you may choose to have:

Web server Monitoring:

  1. Alert me if CPU utilization is high
  2. Alert me if hard disk failed
  3. Alert me if certificate is expired
  4. Alert me if network resource is not reachable
  5. Alert me if replication is broken
  6. Alert me if latency greater than threshold
  7. Alert me if hard disk is running out of space
  8. Alert me if some unauthorized admin access activity happen
  9. Alert me if software patch on the server failed
  10. Alert me if my primary server failed over.
  11. Alert me …there could be many more

Great to see so many alerts can be caught in the system just for this one component (web server) for the example I’m talking about. Now expanding the same concept to the complete spectrum of your entire service design/architecture, you can imagine the number of monitoring points and alerts you need to build into the system with a potential burst in a number of alerts the system may generate in a real-time production environment. In general, when these alerts are designed it may be classified  in any of 3 general categories: Informational, Warning and Critical. Depending on the requirement,  each alert may result in an action or set of actions to mitigate the risks to the service. Given the complexity of the hotel reservation system, if you have only a handful of customers, managing the alerts for this scale may be easy and manageable. If the scale grows and you have thousands of customers, then the complexity increases. The alerts from the system could be specific to one set of logical customer groups, or it might be to the entire customer base, or to the entire service. Managing these alerts and resolving them in a timely fashion becomes one of the critical factors for customer satisfaction and could increase the COGS [Cost Of Goods Sold] for the service.

Now as a tester it’s our/my responsibility to validate all those alerts and make sure it works the way it’s supposed to. Wow. It sounds that simple when I wrote this sentence, in reality it’s a uh…a tall order. The team trying to achieve the magic nine SLA will attempt to put as many alerts here there and everywhere to catch any possible issues. When this large number of alerts is combined with the scale your service architecture is designed to work at, it may turn out to be a nightmare to identify legitimate alerts, manage them and resolve the issues. Over a period of time when something goes wrong with the system resulting in a service outage we tend to add more and more alerts. The bottom line is the more alerts you put into the system the more chances that it would end up creating noise, and over a period of time these noises may become overwhelming and ignored.  Ignored alerts may hide a legitimate alert, resulting in a disaster. As a tester, we should review all alerts and take the time to categorize them into appropriate buckets and ensure each alert leads to an action to rectify the problem. To test these buckets of alerts in the lab is a challenging task, as is simulating the failure points into the system to trigger the alerts. Once it’s successful to simulate, these tests results will provide confidence to the operations engineering team that they will be able to handle and manage them quickly and effectively.

Alerts are really important to the service. Be thoughtful on what alerts you add. It is really important to weigh the number of alerts you want to trigger from your system and keep it balanced so that quality of the service is maintained. Review all the alerts carefully and ensure that each results in an actionable item to fix the system. Keep alerts as alerts and don’t let them create noise into your system. Control the COGS for your service effectively and make the monitoring and alerting efficient.

Happy monitoring!

%d bloggers like this: