Feature Request: Rowsegment Break Option

richard.obrien · August 29, 2024, 3:20pm

Greetings SingleStore Team,

We have been using the columnstore as a demo repository for financial market tick data time series.

For example, our trade event schema looks like this:

where the PRIMARY KEY sort is on the SecurityID (integer version of a Ticker like AAPL, GOOG, etc), then DateTimeKey and SequenceNumber (to disambiguate events that happen at the same millisecond).

When we do a FULL OPTIMIZE on this table we get rowsegments that get sorted like this (data from information_schema.COLUMNAR_SEGMENTS for SecurityID & DateTimeKey columns):

What I would like is for each SecurityID to ideally not share a rowsegment with other SecurityIDs. Put another way, I would like the ability to break rowsegments by the first key(s) of the table.

If I query for SecurityID=62418 AND DateTimeKey BETWEEN ‘2019-08-03’ AND ‘2019-08-10’ the query engine will look in three row segments: 2320465, 2320469 and 2320477. This is because the MaxDateTimeKey of SecurityID 21536 is 2020-01-10, the MinDateTimeKey of SecurityID 68206 is 2016-05-10 and these neighboring securityids share a rowsegment with SecurityID 62418.

This query should ideally only open one rowsegment: 2320469.

There are several additional expected benefits to having the option for rowsegments to break on the first key(s) of a table.

In this use case, the time, price and quantity features will be in the same domain by securityid so rowsegments will see improved compression rates when not mixing SecurityIDs.

Additionally, when running a FULL OPTIMIZE there will be substantially less work to do. If in this example we insert some rows for SecurityID=21536, ALL of the rowsegments after SEGMENT_ID 2320465 will be rewritten in a FULL OPTIMIZE as the evicted rows from the first rowsegment push into 2320467 and so on such that the entire columnstore is eventually re-written. If rowsegments only contained data for one securityid only rowsegments containing the new merging-in securityids would be impacted. I could see this reducing the runtime of a FULL OPTIMIZE on our 4+ TB dataset from many hours to minutes. Also, less SSD wear.

I am not sure where the most appropriate syntactical place would be for adding this kind of feature. Whether it belongs in the CREATE TABLE syntax or in the OPTIMIZE TABLE syntax. The former being a more holistic solution and the latter perhaps more expedient.

Thanks for the consideration and happy to discuss / clarify!

-Rich

hanson1 · September 9, 2024, 4:45pm

Rich, thanks for your well-written request!

What are your sort key and shard key? Please do SHOW CREATE TABLE for your even table and put the result here.

We’re considering a few things that could help here, such as sub-partition elimination.

We’ve seen requests for range and list partitioning which could help address this kind of thing, but also come with their own complexities and potential for misuse.

In general it is bad to have segments that are too small because you end up with too many files (blobs) in the storage layer, and you need metadata to reference each blob. So list partitioning by securityID could cause that kind of problem if there are too many securityID values. How many securityID values do you have in your table?

It may be possible to get queries to eliminate more rows by reducing segment size, having more partitions, and choosing the best possible sort and shard keys. But having more partitions and segments comes with overhead, as I mentioned.

For our reference, this relates to this internal feature request: PM-1431

richard.obrien · September 10, 2024, 2:46pm

CREATE TABLE `quotes_rt` ( 
	`SecurityID` int(11) NOT NULL, 
	`DateTimeKey` datetime(6) NOT NULL, 
	`SequenceNumber` smallint(6) NOT NULL DEFAULT -32768, 
	`BidPrice` double NOT NULL, 
	`BidQuantity` int(11) NOT NULL, 
	`BidConditionCode` varchar(4) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, 
	`BidSource` varchar(4) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, 
	`AskPrice` double NOT NULL, 
	`AskQuantity` int(11) NOT NULL, 
	`AskConditionCode` varchar(4) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, 
	`AskSource` varchar(4) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL, 
	SORT KEY `SecurityID` (`SecurityID`,`DateTimeKey`,`SequenceNumber`), 
	SHARD KEY `SecurityID_2` (`SecurityID`) 
) 
AUTOSTATS_CARDINALITY_MODE=INCREMENTAL 
AUTOSTATS_HISTOGRAM_MODE=CREATE 
AUTOSTATS_SAMPLING=ON 
SQL_MODE='STRICT_ALL_TABLES,
NO_AUTO_CREATE_USER'

This is the quotes table, containing 249.7 billion rows. It is sorted by SecurityID then DateTimeKey and a SequenceNumber to disambiguate events that occur at the same time. SecurityID have sharded pretty evenly across nodes which is nice.

With 1mm length rowsegments we have roughly 249,725 rowsegments in the perfectly sorted version of this table.

The table has 10,500 unique SecurityIDs. About 7000 securities have >= 1mm quote events (249.1bb quotes for this group) and 3500 have < 1mm quote events (606mm quotes in this group).

If we broke rowsegments on securityid there would be about 256,161 rowsegments: a 2.8% increase in rowsegment count. Note: to get this I took the sum of CEILING(rowcount_by_security_id / 1e6) across all securityids.

So in terms of tradeoff here: is the cost of searching through an extra 2.8% of rowsegment metadata worth the savings of avoiding unnecessarily decompressing & filtering an extra two rowsegments of 2mm rows of data in every query? I don’t know the fine details of how rowsegment elimination works but checking 7000 extra rowsegment metadata items vs 2mm rows seems like it would be less? Does that make sense? What do you think?

Kind regards.

richard.obrien · September 10, 2024, 3:26pm

There is a data loading element to this as well where I do not know how the tuple mover would want to handle this. Specifically, ingest is a realtime stream with lots of security IDs coming in simultaneously.

Is it expensive to be simultaneously tuple moving into 5000 open/appending rowsegments?

Perhaps rowsegment breaking should only happen through an OPTIMIZE command and not be strcitly/uniformly enforced at the CREATE TABLE schema level?

hanson1 · September 10, 2024, 4:53pm

Thanks for sharing these details, Rich. it’s hard to tell if list partitioning would help a lot in this situation – it might help a little. However, we are investigating sub-segment elimination which would use more min/max metadata at finer-than-segment level. That could help in your situation. I’ll pass your use case along to the developers.

richard.obrien · September 11, 2024, 4:17pm

Partitioning does help our use case, but primarily in the OPTIMIZE TABLE FULL use case.

Partitioning is what we do with MSSQL’s columnstore. We have partitions based on a date range, for instance, monthly or quarterly. Once the month or quarter is complete we perfectly sort and re-insert the data by SecurityID & DateTimeKey to get a static “golden master” set of rowsegments that should never need to be re-written because we do not expect new trades or quotes to come in for that month. We also carefully manage the batch loading process to get the delta store to break on SecurityID where possible to prevent more than one securityid residing in any rowsegment.

The way partitioning is helpful is that generally, once a time range of data is “closed” it never needs to be re-sorted / optimized again.

With SingleStore currently if I insert just 1 new row for the smallest SecurityID, OPTIMIZE TABLE FULL will properly insert that row where it belongs somewhere in the first few rowsegments. Inserting that row into an already full row segment will then evict 1 row into the next rowsegment which will evict 1 row into the next rowsegment and so on. Such that optimizing after inserting just 1 row into the table requires decompressing & recompressing the ENTIRE table.

This seems highly unnecessary / inefficient. What I am proposing (in the other thread) is that, OPTIMIZE FULL when it inserts that one row, it simply keeps the now 1,000,001 sized rowsegment without evicting any rows. If it does that there is nothing else for the optimizer to do. In this way having some flexibility around rowsegment length can result in much much faster full optimizations. +/-30% of default/optimal rowsegment length seems reasonable to me. Like if a merger is putting 300k new rows into a 1mm rowsegment is ok but 350k is too much such that the merger results in 2 new rowsgments of 675k+675k=1.35mm. The goal here being that we want to avoid merging “spillover” that causes more rowsegments to be unnecessarily touched. So much of the data in the columnstore rowsegments are already sorted I imagine the background optimization process would use substantially fewer cycles with this kind of approach.

The tolerance % should just be a parameter where the current merger algorithm tolerance is 0%.

-Rich

hanson1 · September 11, 2024, 10:18pm

Richard, thanks for describing this in more detail. I see the merits of what you are suggesting. I opened an internal feature request for this. Our engineer familiar with this recommends running OPTIMIZE TABLE without FULL in your situation.

In your environment, what is your hardware and data size and how long does OPTIMIZE TABLE FULL take and what are the ramifications of that for you?