Optimizing Performance of "ORDER BY 'X' DESC" on Large Tables

ymonye · August 31, 2020, 2:59am

Hi all,

Is there any way to optimize row sorting in MemSQL?

For example, I have a 600M+ row table of customer transactional data and the first column row is an auto_increment integer column. I’m able to run a query that filters by customer information, which drops my overall result to 18M+ rows. However, if I add “ORDER BY row DESC” the performance drastically drops.

I tested an alternate scenario by creating a new table with only the row column and those 18M+ rows, the absolute minimum dataset I’d work on at a time, followed by “ORDER BY row DESC LIMIT 25” and performance still suffers.

The end goal is to display the latest 25 entries filtered by customer, hence applying that “ORDER BY row DESC LIMIT 25” query, yet working with anywhere from 600M to over 1B rows of data in a 1.5TB database. The query execution time is at minimum 12 seconds, not at all optimal for customers accessing the dataset through a frontend / website.

adam · August 31, 2020, 6:14pm

Hi,

We’re going to need some more details to help debug it. We likely need to see the query itself and a what profile of your query.

You can also use studio to visualize it:

I wouldn’t think sorting 17 million rows could be that expensive if your only returning the top 25.

-Adam

ymonye · August 31, 2020, 9:45pm

Hi adam,

I’ve created two tables isolating those 3 customer data columns of interest. Both tables each have a total of 648,260,111 rows.

CREATE TABLE IF NOT EXISTS tbl_data_columnstore_shards_no (row BIGINT, addrFrom CHAR(42), addrTo CHAR(42), KEY(row) USING CLUSTERED COLUMNSTORE);

CREATE TABLE IF NOT EXISTS tbl_data_columnstore_shards_yes (row BIGINT, addrFrom CHAR(42), addrTo CHAR(42), KEY(row) USING CLUSTERED COLUMNSTORE, SHARD KEY (addrFrom, addrTo));

The addrFrom and addrTo columns are filled with customer info in hexadecimal format. An example cell: 0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98

Here is the query for the 1st table:

SELECT row FROM tbl_data_columnstore_shards_no WHERE (addrFrom = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’ OR addrTo = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’) ORDER BY row DESC LIMIT 25 OFFSET 0;

Here is the query for the 2nd table:

SELECT row FROM tbl_data_columnstore_shards_yes WHERE (addrFrom = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’ OR addrTo = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’) ORDER BY row DESC LIMIT 25 OFFSET 0;

Below are links to the pastebins of both JSON files.

tbl_data_columnstore_shards_no (18 seconds to run):
https://pastebin.com/yAZeZex0

tbl_data_columnstore_shards_yes (18 seconds to run):
https://pastebin.com/aeVFbxFZ

Below are screenshots to the profiles of both queries.

tbl_data_columnstore_shards_no (18 seconds to run):

tbl_data_columnstore_shards_yes (18 seconds to run):

ymonye · August 31, 2020, 10:09pm

I re-ran both queries without the “ORDER by row DESC” argument to demonstrate the fact that sorting is killing the overall query performance.

Here is the query for the 1st table:

SELECT row FROM tbl_data_columnstore_shards_no WHERE (addrFrom = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’ OR addrTo = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’) LIMIT 25 OFFSET 0;

Here is the query for the 2nd table:

SELECT row FROM tbl_data_columnstore_shards_yes WHERE (addrFrom = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’ OR addrTo = ‘0xFBb1b73C4f0BDa4f67dcA266ce6Ef42f520fBB98’) LIMIT 25 OFFSET 0;

Below are links to the pastebins of both JSON files.

tbl_data_columnstore_shards_no (435 ms to run):
https://pastebin.com/n0JAUqJy

tbl_data_columnstore_shards_yes (400 ms seconds to run):
https://pastebin.com/hRbdKVgJ

Below are screenshots to the profiles of both queries.

tbl_data_columnstore_shards_no (435 ms to run):

tbl_data_columnstore_shards_yes (400 ms to run):

adam · August 31, 2020, 10:33pm

Thanks,

MemSQL doesn’t yet support a reverse scan of the sort key. So, once you add the order by we need to scan the entire result and do a top sort (instead of just scanning in reverse order and stopping when we find the top X). You’ll notice in the fast case without the order by the scan stopped much sooner (and all the extra cost in the profile is form needing to scan the entire table).

If your running at least memsql 7.x you can create your sort key with desc sort order.

CREATE TABLE IF NOT EXISTS tbl_data_columnstore_shards_no (row BIGINT, addrFrom CHAR(42), addrTo CHAR(42), KEY( row desc) USING CLUSTERED COLUMNSTORE);

This way you’ll notice the top sort goes away in the profile (a limit is pushed down directly over top of the scan so you’ll get the same performance as without the order by).

ymonye · September 1, 2020, 10:45am

Wow, thanks a ton!

Our longest query was dropped from ~18 seconds to 1.638 seconds (according to our profiler), just from that sort modification.