A month ago we found that joining on integer
/ binary
columns is significantly faster than char(32)
:
The
binary
data type was 2.3 times faster thanchar
. Thebigint
data type was 1.5 times faster thanbinary
and 3.5 times faster thanchar
.
What is best practice for CLUSTERED COLUMNSTORE keys - #6 by hanson - Query Tuning - SingleStore Forums
Today, we experienced this is also the case when filtering data even with full index coverage, but only on rowstore tables - with columnstore tables we experienced the opposite.
I’ll give our example to show the performance improvement:
We have a table with keywords. Each keyword has a location (geotarget). The original table used varchar
to store an ISO country code of two letters, e.g. DK:
CREATE TABLE `keywords-varchar` (
`id` binary(16) NOT NULL,
`location` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `location` (`location`)
)
We decided to test with location IDs consisting of four digit numbers, eg. 1000:
CREATE TABLE `keywords-int` (
`id` binary(16) NOT NULL,
`location` int(10) unsigned DEFAULT null,
PRIMARY KEY (`id`),
KEY `location` (`location`)
)
We used a stored procedure to measure execution time while ignoring client-communication (inspired by @hanson from his blog post).
Results based on ~20M rows (all with same location
):
call measure_q("select count(*) from `keywords-int` where location = 1000", 100);
155,289
call measure_q("select count(*) from `keywords-varchar` where location = 'DK'", 100);
482,675
Integers is 3 times faster!
The above queries is just to illustrate the performance difference. With actual real-application queries we’re seeing 10x faster execution.
Funny thing. We tried using mediumint (8)
as well to reduce memory usage, but it performs not far as good as int (10)
. Not sure if we’re missing something:
CREATE TABLE `keywords-mediumint` (
`id` binary(16) NOT NULL,
`location` mediumint(8) unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `location` (`location`)
)
call measure_q("select count(*) from `keywords-mediumint` where location = 2208", 100);
389,273
Disclaimer: In the example, we had a single cardinality. The results might be different with a higher cardinality. I’ll provide an update when we migrate our production environment.
However, we tried the same with our columnstore table (50M rows) but experienced a 10-20 ms slower execution time. Apparently, columnstore must have some varchar optimization in it’s compression, but based on the rowstore results I wouldn’t have expected columnstore to be slower by using integers.