Greetings SingleStore Team,
We have been using the columnstore as a demo repository for financial market tick data time series.
For example, our trade event schema looks like this:
where the PRIMARY KEY sort is on the SecurityID (integer version of a Ticker like AAPL, GOOG, etc), then DateTimeKey and SequenceNumber (to disambiguate events that happen at the same millisecond).
When we do a FULL OPTIMIZE on this table we get rowsegments that get sorted like this (data from information_schema.COLUMNAR_SEGMENTS for SecurityID & DateTimeKey columns):
What I would like is for each SecurityID to ideally not share a rowsegment with other SecurityIDs. Put another way, I would like the ability to break rowsegments by the first key(s) of the table.
If I query for SecurityID=62418 AND DateTimeKey BETWEEN ‘2019-08-03’ AND ‘2019-08-10’ the query engine will look in three row segments: 2320465, 2320469 and 2320477. This is because the MaxDateTimeKey of SecurityID 21536 is 2020-01-10, the MinDateTimeKey of SecurityID 68206 is 2016-05-10 and these neighboring securityids share a rowsegment with SecurityID 62418.
This query should ideally only open one rowsegment: 2320469.
There are several additional expected benefits to having the option for rowsegments to break on the first key(s) of a table.
In this use case, the time, price and quantity features will be in the same domain by securityid so rowsegments will see improved compression rates when not mixing SecurityIDs.
Additionally, when running a FULL OPTIMIZE there will be substantially less work to do. If in this example we insert some rows for SecurityID=21536, ALL of the rowsegments after SEGMENT_ID 2320465 will be rewritten in a FULL OPTIMIZE as the evicted rows from the first rowsegment push into 2320467 and so on such that the entire columnstore is eventually re-written. If rowsegments only contained data for one securityid only rowsegments containing the new merging-in securityids would be impacted. I could see this reducing the runtime of a FULL OPTIMIZE on our 4+ TB dataset from many hours to minutes. Also, less SSD wear.
I am not sure where the most appropriate syntactical place would be for adding this kind of feature. Whether it belongs in the CREATE TABLE syntax or in the OPTIMIZE TABLE syntax. The former being a more holistic solution and the latter perhaps more expedient.
Thanks for the consideration and happy to discuss / clarify!
-Rich