QA Engineering Performance Analysis: Don’t underestimate the power of the 99th percentile

If you are in the software development or software quality assurance stream, one of the most important facts for you is the performance (The performance of your application). Simply you want your application to run smoothly with an increased amount of a practical load.

In order to measure the possibility, you are probably using different tools and methods in the industry. It’s true the tools will get your inputs and give you some results. But still we have a problem in analysing the data to get an idea about our systems performance.

Let’s assume you decided to test your systems performance using Jmeter (Find more about Jmeter If you are not aware about it -  https://jmeter.apache.org /). Then you’ll run some tests and get some stats about your system. If you are a newbie you’ll probably think about the Average response time and sometimes you might feel confident after seeing the results. The problem is, although you have a set of positive proof of saying the system runs smoothly; Do you have the current proof?

Let’s take a small example: (Note: These are some hypothetical data)

 Request Count
Response Time
20
100ms
18
110ms
20
112ms
20
115ms
20
200ms
1
10000ms
1
5000ms



According to the above table and the data given the Average response time will be: 257.2ms. But you noticed something isn’t it? I guess you noticed the spike in the chart. Although the average response time is great, that spike changes everything. Because the Average value tells us the half of the story, sometimes not even the half.

DO NOT UNDERESTIMATE THE POWER OF A SPIKE

What could be the possible reason for that spike? As you know that spike caused due to a response time delay on your request. There could be different possible reasons such as Network issues, Caching issues etc. With all the possible reasons, the reason which is important to us is the latency in the downstream requests.

What are downstream requests?

To give an answer to this question. Let’s look at what’ll happen when you send a request for https://www.google.lk .
This is what is takes for google to display you the UI that you know of. 
Host
Company
Category
Total bytes
Average load
Number of requests
 google.lk
Unknown
Unknown
219
218
1
 www.google.lk
Unknown
Unknown
267138
1976
10
 www.gstatic.com
Google
Advertising
61186
388
2
 ssl.gstatic.com
Google
Advertising
7325
343
1
 fonts.gstatic.com
Google
CDN Fonts
30780
550
2
 apis.google.com
Google
Hosted Libraries
49577
382
1
 ogs.google.com
Google
Hosted Libraries
58719
399
1
 adservice.google.lk
Unknown
Unknown
0
286
2
adservice.google.com
Google
Hosted Libraries
0
197
2


















When you request for the google search page there are another 24 requests in total is been sent to various locations before it gets back to you. Those requests are called as downstream requests.

What will happen if a downstream request got delayed?

This is where we need to have our forces on. Most of the time all these downstream requests and their response times do matter for us to see the GUI where we work on. Due to that reason a delay a downstream request will after the total response time of our system.

Percentile

A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value below which 20% of the observations may be found.

An important reason why we need percentile is to measure what is the possibility of a user to experience a defined worse experience in related to response time. To make it simpler let’s think about a normal web site where it contains around 40 downstream requests.

With a test done with 100 requests recorded the response times. When we rearranged the response times in the assenting order the chart looked like this.


Now the million Doller question is, what is the possibility of a user to experience a response time worse than the highest last record. In other words, what is the possibility of a user to experience a response time worse than the 99th percentile response time value (500ms)

With 40 downstream requests, the possibility is (1 – (0.99)40) * 100 %

Which is, 66%. That’s interesting, isn’t it? This is the fact that we need to worry about. There is a 66% chance for our website uses to experience the worst response time (100th percentile) when our site has 40 downstream requests.

Our 50th percentile response time is under 200ms, right? So, what is the possibility of a user to experience a response time better than that? it’s (0.540) * 100%

Which is: 0.00000000000909%, in other words, it’s almost 0%.

As a conclusion, most of our website users experience a response time worst than the 90th percentile value of the response time graph. What do you think is the average response time of the above-given data set? It’s 150.159ms. How wrong are we to assume we are performing good based on the average when the probability of user experience that response time is almost 0%?

There are situations where we need to use the average, and there are ways that you need to use the average. But most certainly not at once. Before going for a response time average you need to make sure you do a proper analysis.

Until the next article, keep in mind, “Don’t underestimate the power of the 99th percentile”

Muditha Perera
Senior Software Quality Assurance Engineer,
Intervest Software Technologies,
Linked in: https://www.linkedin.com/in/ulmmperera


Comments

Popular posts from this blog

Jenkins Pipeline Setup for Test Automation

Test Automation With Robot Framework