Calculating mean time to failure in performance testing

Calculating MTTF (mean time to failure) can be a difficult for testers in order to develop a performance test pass as there are multiple steps. This expert tip will guide you through process,

John Overbaugh
John Overbaugh

A challenge web application performance testers face is developing metrics for a successful performance test pass. In this tip, I focus on developing the metrics for a mean time to failure (MTTF) test pass.

Mean time to failure is the duration (in time or transactions) after which the system under test is likely to fail. Obviously, the higher the MTTF, the better the application. When devising MTTF metrics or requirements, I calculate my measurements to a lowest-common-denominator. In most cases, this is the transaction.

I define a web application transaction as one request to or one response from the server, regardless of whether the request/response is a user-initiated URL request or an Ajax/RIA request initiated behind the scenes. That's the beauty of the transaction as a measurement: it's source-agnostic! Not only is the transaction independent of the underlying implementation, it's also easily monitored, logged, and measured in sequential order in the web server log.

The first step in developing MTTF metrics is to identify a suite of user scenarios. For instance, let's assume we're testing a simple web site which allows users to log in, view their graffiti wall, upload comments to their wall and read comments on their wall from other users. A common user scenario, therefore, might include:

  1. Login request/response: this action is a combination of several requests and responses, including session creation/cookie management, secure transactions where the username and password are sent to the server, etc. For the sake of this exercise, let's assume this is 25 requests, with 20 corresponding responses.
  2. Graffiti wall redirect: once the user is successfully logged in, a series of request/response transactions occur in which the user's browser is redirected to the graffiti wall page. At this point, numerous graphics and other page elements are requested and returned to the user. Let's assume this activity results in 30 requests and 22 responses.
  3. Next, the user clicks to view three other graffiti walls of her friends. Each wall is an average of 38 requests and 38 responses, repeated three times.

It's a gross over-simplification to assume this one single scenario represents our mock site's complete user traffic, but again the purpose is only to illustrate generating metrics. Given the information above, we can assume our site experiences 25+20+30+22+((38+38)*3) or 325 transactions for each user login. Congratulations – we've generated our first (and only) user scenario!

The next step in arriving at an MTTF metric is to estimate the user scenarios per day. Let's assume each user, thrilled by the frequent contact with his or her friends, logs into the site five times daily: once in the morning, once at lunch time, once at dinner and twice in the evening between 8 and 10.

Note that, up to this point the data we have gathered is applicable for both MTTF as well as response/load testing. From here forward, we'll be focusing only on the MTTF testing.

With the total number of logins per day, we know that the total transactions per user per day is 325*5 = 1625. Let's be generous and assume we have 1000 users, of whom 50% are active on a daily basis. This means 2925*(1000*.5) = 812,500 transactions per day (while this seems like a huge number, spread out across an 18 hour day this means 45,139 transactions per hour).

Calculating our MTTF requirement is rather easy at this point – we work with our operations team to identify their goal, which is an uptime of 30 days (their goal for service window is a server restart once in thirty days). Therefore, 10,968,750 * 30 = 24,375,000 transactions total. In order to prove our system can remain responsive for the required window, we need to prove the system can process about 24 million successive transactions before crashing.

The MTTF goal has been identified, the tester needs to automate the aforementioned user scenario and play it back, monitoring the total transaction count. My experience shows that, in a robust site, I can speed up the automated playback to the point where the first performance bottleneck reaches about 90% of capacity. A less robust site generally requires me to dial up playback to around 75% of capacity. Let's assume our site is robust, and that we reach 90% CPU at a sustained rate of 400,000 transactions per hour. 24,375,000 / 400,000 = 61, or a total of 61 consecutive hours at the sustained transaction rate. This is the amount of time our automated test needs to run, at 90% of capacity, in order to prove our 30-day uptime.

Again, this is a gross over-simplification of our site. Its goal is to illustrate how a test team can calculate the MTTF both in terms of total transactions as well as total run time. In another tip, I'll use the same test scenario to show how a test team can calculate load and response time thresholds based on the same input.

Dig Deeper on Software testing tools and techniques

Cloud Computing
App Architecture