WavebreakmediaMicro - Fotolia


4 tips to tackle common microservices testing problems

Flaky tests put code quality -- and development team morale -- at risk. Proliferating microservices stress out test environments, so use these techniques to keep everything running.

The initial promise of microservices architecture was that it could drastically reduce regression testing. By composing applications through services that have individually defined service contracts -- sometimes called a test suite -- you could hypothetically deploy services separately. And as long as the service continued to pass the contract tests, it could switch in and out with ease. Unfortunately, it doesn't play out that way.

In the pursuit of continuous delivery, software teams want to deploy features as often as possible, and they also want to thoroughly test the GUIs before deploying to production. To move quickly, they implement tooling that automates end-to-end testing. This way, the team can simply deploy a change to the system integration test (SIT) environment, run the tests and wait for the results.

But say, for example, that results show failures for four out of the two dozen sub-tests, and each of the failures seem to be random. It's even possible that on a rerun two hours later, a different set of three or four different sub-tests fail. At that point, the lead testers often mumble something about the flaky nature of the sub-tests and rerun the failed tests by hand, turning what was supposed to be an automated process into a manual one.

What is the problem?

The problem is almost certainly not the tests. At least, it is not the way the tests are written.

It is much more likely to be the SIT environment. Perhaps, as the tests were running, someone rolled out a new login service or GUI element. Perhaps, the service was only down for seven minutes, right when the end-to-end tests were calling the service. Or it could have been a change in the database layer, which spits out slightly different -- but still correct -- results due to a service call. In any event, by the time a human is retesting the system, everything is back up in working order.

The first problem is that the rollouts take too long, and the odds are far too high that a test will exercise a service that's in the middle of a rollout. As teams gain a better understating of end-to-end tests and perform more of them, those odds increase. For instance, a login service may be down because the login team is testing a change. This situation might seem manageable when a company has two teams, but when 20 or more teams work on one popular service, it can make progress impossible for half the organization.

And, of course, GUI or functionality changes might cause a test to report those changes as a failure. Once a test runs repetitively, in a sense it essentially acts as change detection. But imagine running continuous change detection in an environment with 60 teams, all making multiple deploys to the SIT environment daily. Each change looks like a test failure to teams that didn't know about the deployment. Mature companies with interdependent services are even more likely to have this problem, because if any service in the chain is down, the tests will definitely fail.

The answer to these problems is to strengthen the SIT environment. You can do this by reducing the number of unique but similar microservices that developers create, investing in better update deployments, adopting tracing technology to pinpoint failures and getting familiar with mock code.

Focus on service glossaries and versioning

The first remedial step for microservices testing issues is to make a service glossary and focus on reuse. With healthy collaboration, an organization can develop a wide variety of API types. But without communication, software teams risk creating redundant and slow services that introduce complex dependency chains. To combat this, incorporate a service glossary for the entire organization and increase efforts to enable API reuse and run fewer deployments.

If your teams already coordinate releases with integration as a focus, the next step is to version services and make them backward compatible. The other teams can then ask for a specific version if needed.

Pursue faster rollouts

Service-level agreements that give teams long timeframes to get new code in production, such as four hours, can cause problems with test conflicts and downtime. In these long rollouts, an operations team receives a request for a specific build to launch on live servers, and can take roughly 30 minutes to actually turn off of the web server, swap the code and reboot the web server. During that time, the software dispatches a message that the system is down for maintenance.

Modern systems are designed differently. Instead of one massive shift from an old version to the new one, the system updates by one service at a time. To make services more comprehensive with this fast of a rollout, add failover code to tests instructing them to wait slightly longer than the typical update takes, then try to run again.

This combination of a fast response and quick retries can reduce the number of false failures in SIT, but only if the code works right the first time.

Use tracer bullets

Some view SIT failures as a problem not with the code, but with what testers can observe during the process; it's often unclear exactly what went wrong and when. One remedial practice is to implement tracer bullets -- a term coined in The Pragmatic Programmer: From Journeyman to Master -- into every service call.

Tracer bullets have a unique identifier, and you can use them to search service call records in the same way you use a stack trace when debugging traditional code. The results that come back show the services, the order of calls, how long each call took and a status report. With that information, testers can find slow and flaky services with just a few queries.

Create mocks and stubs

The ultimate goal is to get code to work the first time in the SIT. While the broad discussion of how to improve code quality is extensive, consider elements like code-as-craft, unit tests and integration tests before deploying to the SIT environment. Many teams believe they can't do this, because SIT is where everything comes together.

The answer to that problem might be service virtualization. Service virtualization involves running end-to-end tests on all the services and recording a copy of those test processes and results. Put the resulting pairs of I/O payloads into a service, and you'll end up with a mock service to run tests against. This process can be a little tricky when dealing with variables such as dates and order numbers; however, with a little programming, it's possible to create a development environment that looks and behaves like an integration environment, especially for humans or an automated tool to test against before code moves to the SIT environment.

Use the mock services in the SIT environment to run tests against a real service. However, if those services are down, use a stub, which is a dummy module that testers create using canned data when the actual modules are unavailable.

Dig Deeper on Application development and design

Software Quality
Cloud Computing