Is there any way to optimize row sorting in MemSQL?
For example, I have a 600M+ row table of customer transactional data and the first column row is an auto_increment integer column. I’m able to run a query that filters by customer information, which drops my overall result to 18M+ rows. However, if I add “ORDER BY row DESC” the performance drastically drops.
I tested an alternate scenario by creating a new table with only the row column and those 18M+ rows, the absolute minimum dataset I’d work on at a time, followed by “ORDER BY row DESC LIMIT 25” and performance still suffers.
The end goal is to display the latest 25 entries filtered by customer, hence applying that “ORDER BY row DESC LIMIT 25” query, yet working with anywhere from 600M to over 1B rows of data in a 1.5TB database. The query execution time is at minimum 12 seconds, not at all optimal for customers accessing the dataset through a frontend / website.
I’ve created two tables isolating those 3 customer data columns of interest. Both tables each have a total of 648,260,111 rows.
CREATE TABLE IF NOT EXISTS tbl_data_columnstore_shards_no (row BIGINT, addrFrom CHAR(42), addrTo CHAR(42), KEY(row) USING CLUSTERED COLUMNSTORE);
CREATE TABLE IF NOT EXISTS tbl_data_columnstore_shards_yes (row BIGINT, addrFrom CHAR(42), addrTo CHAR(42), KEY(row) USING CLUSTERED COLUMNSTORE, SHARD KEY (addrFrom, addrTo));
The addrFrom and addrTo columns are filled with customer info in hexadecimal format. An example cell: 0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98
Here is the query for the 1st table:
SELECT row FROM tbl_data_columnstore_shards_no WHERE (addrFrom = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’ OR addrTo = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’) ORDER BY row DESC LIMIT 25 OFFSET 0;
Here is the query for the 2nd table:
SELECT row FROM tbl_data_columnstore_shards_yes WHERE (addrFrom = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’ OR addrTo = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’) ORDER BY row DESC LIMIT 25 OFFSET 0;
Below are links to the pastebins of both JSON files.
MemSQL doesn’t yet support a reverse scan of the sort key. So, once you add the order by we need to scan the entire result and do a top sort (instead of just scanning in reverse order and stopping when we find the top X). You’ll notice in the fast case without the order by the scan stopped much sooner (and all the extra cost in the profile is form needing to scan the entire table).
If your running at least memsql 7.x you can create your sort key with desc sort order.
CREATE TABLE IF NOT EXISTS tbl_data_columnstore_shards_no (row BIGINT, addrFrom CHAR(42), addrTo CHAR(42), KEY( rowdesc) USING CLUSTERED COLUMNSTORE);
This way you’ll notice the top sort goes away in the profile (a limit is pushed down directly over top of the scan so you’ll get the same performance as without the order by).