SingleStore introduced an APPROX_PERCENTILE function to estimate percentiles for a distribution of values. This function is faster and near-exact as compared to the analogous functions that calculate exact percentiles. Read more to learn about this feature and see the performance and accuracy comparisons between approximations and precise values on a sample dataset.
Please check this demo video to learn more about the approx_percentile function in SingleStore:
If you have a sizable numeric dataset that you are analyzing, you likely require the ability to do some standard statistical analysis on it (and maybe more advanced analytics, too!)
Perhaps your dataset represents the frequency of trades for stocks in a portfolio, clicks per user from your mobile application, or the average consumption from smart meters in a given day. For any of these types of datasets, you may need to pinpoint outliers, determine the median, or in general, better understand the distribution of your data (it’s always a normal distribution, right?). One analysis you might do to understand data distribution better is to use functions to calculate percentiles.
Notable examples of these functions may include PERCENTILE_DISC
or PERCENTILE_CONT
(which we have in SingleStore, and they are also industry-standard). PERCENTILE_DISC
calculates the observed percentile, while PERCENTILE_CONT
estimates the quantile depending on the input distribution. If accuracy is more important in this calculation, you should use PERCENTILE_CONT
.
Regardless, using either of these functions to calculate percentiles on a vast dataset can be painfully slow; precise percentile calculations are expensive operations because the data must be ordered before a result can be returned. With large datasets, this ordering operation can use many resources and ultimately is a significant bottleneck.
We recently introduced an APPROX_PERCENTILE
built-in function (documentation) to allow users to efficiently approximate the values at a given percentile for a distribution of values without giving up accuracy in the computation. SingleStore is built for speed and scale, and so our implementation of APPROX_PERCENTILE
leverages all of the SingleStore database’s performance benefits. Now, let’s get into a real-world example of using APPROX_PERCENTILE
and comparing the performance to analogous percentile functions.
The Data and Hardware
We will demonstrate to you the performance benefits and efficiency using APPROX_PERCENTILE
versus PERCENTILE_CONT
.
We use a dataset that consists of metrics measured from operating systems and WebLogic Server monitoring beans provided by the Kaggle Datasets. Here are the characteristics of the data and system we are running this on:
- The dataset contains 60M rows.
- We are running on i3.xlarge Amazon EC2 instances.
- The SingleStoreDB Self-Managed cluster size is 4 Leaf Nodes.
- The cluster has 16 leaf cores total (4 per leaf).
Problem
Let’s say you’re analyzing network data for all your applications. You want to understand the connection delays across applications to pinpoint the ones that need improvement.
Here’s the table definition of the characteristics of your table that’s monitoring the connection info for your applications:
=>CREATE TABLE cluster_management(
host CHAR(16) NOT NULL,
process CHAR(16) NOT NULL,
process_started DATETIME NOT NULL,
is_anomaly BOOL NOT NULL,
connection_delay DOUBLE NOT NULL,
process_cpu DOUBLE NOT NULL,
thread_user_time BIGINT NOT NULL,
thread_cpu_time BIGINT NOT NULL,
KEY(process) USING CLUSTERED COLUMNSTORE,
SHARD KEY(host, process)
);
To identify the slowest processes, say you want to calculate the percentiles of the connection delays in the 95th percentile, grouped by process and host.
First, we run this using PERCENTILE_CONT, which calculates exact percentile distributions. Below is the query output using PERCENTILE_CONT. It takes over 30 seconds to execute.
=>select PERCENTILE_CONT(0.95) WITHIN GROUP(ORDER BY connection_delay) as conn_delay_95, process, host from cluster_management
group by process, host order by host, process;
+--------------------+---------+----------+
| conn_delay_95 | process | host |
+--------------------+---------+----------+
| 60.33306084414562 | wls1 | lphost06 |
| 74.22222222222223 | wls2 | lphost06 |
| 90.1111189787824 | wls1 | lphost07 |
| 97.16666666666667 | wls2 | lphost07 |
| 48.22222222222222 | wls1 | lphost08 |
| 81.22222222222223 | wls2 | lphost08 |
| 504.6111111111111 | wls1 | lphost09 |
| 114 | wls2 | lphost09 |
| 257.94444404835895 | wls1 | lphost10 |
| 398.77777777777777 | wls2 | lphost10 |
| 244.72222222222223 | wls1 | lphost11 |
| 141.49996132984242 | wls2 | lphost11 |
| 158.60986670805664 | wls1 | lphost14 |
| 53.91070471779503 | wls2 | lphost14 |
| 95.72222222222223 | wls1 | lphost15 |
| 330.3888921935945 | wls2 | lphost15 |
| 337.77777777777777 | wls1 | lphost17 |
| 300.2221403142701 | wls2 | lphost17 |
| 323.6111111111111 | wls1 | lphost18 |
| 388.27793939374266 | wls2 | lphost18 |
+--------------------+---------+----------+
20 rows in set (31.82 sec)
The Reveal
Now, we use APPROX_PERCENTILE to get approximate percentiles for the same percentage value. As you can see, it is 10X faster than the above and completes in a little over 3 seconds!
select approx_percentile(connection_delay, 0.95) as conn_delay_95, process, host from cluster_management
group by process, host order by host, process;
+--------------------+---------+----------+
| conn_delay_95 | process | host |
+--------------------+---------+----------+
| 60.331255432672975 | wls1 | lphost06 |
| 74.22222222222223 | wls2 | lphost06 |
| 90.11111898411292 | wls1 | lphost07 |
| 97.16666666666667 | wls2 | lphost07 |
| 48.22222222222222 | wls1 | lphost08 |
| 81.22222222222223 | wls2 | lphost08 |
| 504.58873982961217 | wls1 | lphost09 |
| 114 | wls2 | lphost09 |
| 257.927873719216 | wls1 | lphost10 |
| 398.8204351566121 | wls2 | lphost10 |
| 244.72222578013086 | wls1 | lphost11 |
| 141.49996122540992 | wls2 | lphost11 |
| 158.59711333367832 | wls1 | lphost14 |
| 53.91070683423085 | wls2 | lphost14 |
| 95.72222222222223 | wls1 | lphost15 |
| 330.42222794891194 | wls2 | lphost15 |
| 337.77666235665123 | wls1 | lphost17 |
| 300.2182932586068 | wls2 | lphost17 |
| 323.6093433110946 | wls1 | lphost18 |
| 388.30877260280937 | wls2 | lphost18 |
+--------------------+---------+----------+
20 rows in set (3.37 sec)
The Margin of Error
Let’s compare the difference between the exact calculation from PERCENTILE_CONT and APPROX_PERCENTILE. We gather the differences in the below table.
+--------------------------------------------------+
| abs(actual.conn_delay_95 - approx.conn_delay_95) |
+--------------------------------------------------+
| 0.000000005330520025381702 |
| 0.00111542112654206 |
| 0.012753374378320359 |
| 0.0017678000164664809 |
| 0 |
| 0.02237128149891987 |
| 0 |
| 0.030833209066713607 |
| 0.0000021164358230407743 |
| 0.033335755317466464 |
| 0 |
| 0.0018054114726453463 |
| 0 |
| 0.04265737883434895 |
| 0.00000010443250175740104 |
| 0.016570329142950868 |
| 0 |
| 0.003847055663300125 |
| 0.0000035579086272718996 |
| 0 |
+--------------------------------------------------+
The average difference between all the values in the distribution is 0.008. This represents a percent error of about 0.002%, which is very accurate!
+-------------------------------------------------------+
| avg(abs(actual.conn_delay_95 - approx.conn_delay_95)) |
+-------------------------------------------------------+
| 0.008353140031257311 |
+-------------------------------------------------------+
Overall, you can use APPROX_PERCENTILE to speed up your percentile calculations without losing a lot of accuracy.
More on APPROX_PERCENTILE
Accuracy and Error
Note that for some workloads, it is imperative to calculate exact percentiles. If that is the case, we would recommend continuing to use functions like PERCENTILE_CONT for this computation. You should use APPROX_PERCENTILE if your accuracy can accept some error tolerance.
Our example above shows that our error is, on average, roughly 0.002 percent off from the exact numbers in our example calculations, which is pretty close. Note that the margin of error will differ across datasets.
Specifically, the error threshold of APPROX_PERCENTILE depends directly on the size of data and the requested quantile. In general, we expect the approximate estimation to differ from the exact estimate within a limited range of the number itself [-0.01,0.01] for a dataset of more than 100k rows. For example, if you have a uniform distribution from 0 to the 200, then the 50th percentile will be 100, 49th – 98, 51th – 102 percentiles respectively. In other words, when you run APPROX_PERCENTILE(column_x, 0.5) on this distribution, you should expect that the return value will be in the range between 98 to 102 from the actual percentile (100). Additionally, accuracy improves with more massive datasets and also as the percentage approaches the lower or upper bounds of the distribution (i.e., 1st and 99th percentile).
Accuracy Argument
By default, in SingleStore, the accuracy parameter for APPROX_PERCENTILE is 0.01. You can also vary this parameter from 0.01 to 0.5. Note that the larger this number, the less accurate the calculation is. However, reducing accuracy will improve performance. If the function expects lower accuracy, we store much fewer data points during processing, leading to a decrease in query execution time. We recommend you keep the default unless you require better performance.
T-Digest Implementation
SingleStore uses the remarkable T-digest algorithm to calculate APPROX_PERCENTILEs. T-digest is generated by clustering real-valued samples and retaining the mean and number of samples for each cluster, also known as a centroid. T-digest keeps a buffer of incoming samples. When the buffer fills, the contents are sorted and merged with the centroids computed from previous samples.
The size bound for centroids is imposed using a scale function that forces clusters near the beginning or end of the digest to be small, possibly containing only a single sample. The scale function is chosen to provide an appropriate trade-off between a very accurate quantile estimate in the tails of a distribution, reasonable accuracy near the median while keeping the number of clusters as small as possible for performance.
Conclusion
APPROX_PERCENTILE allows us to do fast and accurate percentile calculations. The T-digest implementation in this function sorts only a small amount of data at a given time, whereas analogous percentile functions like PERCENTILE_CONT order all data on every partition, which leads to slower performance overall. You may leverage the APPROX_PERCENTILE function for more efficient calculations of percentiles across your dataset, mostly if you are operating on massive datasets, and you can accept a small margin of error for this calculation.
Download SingleStoreDB Self-Managed here and try APPROX_PERCENTILE today!