QA Engineering Performance Analysis: Don’t underestimate the power of the 99th percentile

If you are in the software development or software quality assurance stream, one of the most important facts for you is the performance (The performance of your application). Simply you want your application to run smoothly with an increased amount of a practical load.

In order to measure the possibility, you are probably using different tools and methods in the industry. It’s true the tools will get your inputs and give you some results. But still we have a problem in analysing the data to get an idea about our systems performance.

Let’s assume you decided to test your systems performance using Jmeter (Find more about Jmeter If you are not aware about it - https://jmeter.apache.org /). Then you’ll run some tests and get some stats about your system. If you are a newbie you’ll probably think about the Average response time and sometimes you might feel confident after seeing the results. The problem is, although you have a set of positive proof of saying the system runs smoothly; Do you have the current proof?

Let’s take a small example: (Note: These are some hypothetical data)

*Request Count*	Response Time
20	100ms
18	110ms
20	112ms
20	115ms
20	200ms
1	10000ms
1	5000ms

According to the above table and the data given the Average response time will be: 257.2ms. But you noticed something isn’t it? I guess you noticed the spike in the chart. Although the average response time is great, that spike changes everything. Because the Average value tells us the half of the story, sometimes not even the half.

DO NOT UNDERESTIMATE THE POWER OF A SPIKE

What could be the possible reason for that spike? As you know that spike caused due to a response time delay on your request. There could be different possible reasons such as Network issues, Caching issues etc. With all the possible reasons, the reason which is important to us is the latency in the downstream requests.

What are downstream requests?

To give an answer to this question. Let’s look at what’ll happen when you send a request for https://www.google.lk .

This is what is takes for google to display you the UI that you know of.

Host	Company	Category	Total bytes	Average load	Number of requests
google.lk	Unknown	Unknown	219	218	1
www.google.lk	Unknown	Unknown	267138	1976	10
www.gstatic.com	Google	Advertising	61186	388	2
ssl.gstatic.com	Google	Advertising	7325	343	1
fonts.gstatic.com	Google	CDN Fonts	30780	550	2
apis.google.com	Google	Hosted Libraries	49577	382	1
ogs.google.com	Google	Hosted Libraries	58719	399	1
adservice.google.lk	Unknown	Unknown	0	286	2
adservice.google.com	Google	Hosted Libraries	0	197	2

When you request for the google search page there are another 24 requests in total is been sent to various locations before it gets back to you. Those requests are called as downstream requests.

What will happen if a downstream request got delayed?

This is where we need to have our forces on. Most of the time all these downstream requests and their response times do matter for us to see the GUI where we work on. Due to that reason a delay a downstream request will after the total response time of our system.

Percentile

A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value below which 20% of the observations may be found.

An important reason why we need percentile is to measure what is the possibility of a user to experience a defined worse experience in related to response time. To make it simpler let’s think about a normal web site where it contains around 40 downstream requests.

With a test done with 100 requests recorded the response times. When we rearranged the response times in the assenting order the chart looked like this.

Now the million Doller question is, what is the possibility of a user to experience a response time worse than the highest last record. In other words, what is the possibility of a user to experience a response time worse than the 99^th percentile response time value (500ms)

With 40 downstream requests, the possibility is (1 – (0.99)⁴⁰) * 100 %

Which is, 66%. That’s interesting, isn’t it? This is the fact that we need to worry about. There is a 66% chance for our website uses to experience the worst response time (100^th percentile) when our site has 40 downstream requests.

Our 50^th percentile response time is under 200ms, right? So, what is the possibility of a user to experience a response time better than that? it’s (0.5⁴⁰) * 100%

Which is: 0.00000000000909%, in other words, it’s almost 0%.

As a conclusion, most of our website users experience a response time worst than the 90^th percentile value of the response time graph. What do you think is the average response time of the above-given data set? It’s 150.159ms. How wrong are we to assume we are performing good based on the average when the probability of user experience that response time is almost 0%?

There are situations where we need to use the average, and there are ways that you need to use the average. But most certainly not at once. Before going for a response time average you need to make sure you do a proper analysis.

Until the next article, keep in mind, “Don’t underestimate the power of the 99^th percentile”

Muditha Perera
Senior Software Quality Assurance Engineer,
Intervest Software Technologies,
Linked in: https://www.linkedin.com/in/ulmmperera

Search This Blog

Test Automation