Hello!
We have been using SingleStore managed service for a year now with great success. We want to start making heavier usage of the platform, but our team is finding some difficulties when defining the schema of a billion-rows table so it can support different types of queries with sub-second performance.
The table would contain daily-aggregated data of millions of users. It would have the following columns:
user_id VARCHAR
day DATE
agg_field_1
agg_field_2
…
agg_field_50
Every day we have data for a few million users (from 1 to 5 million), with around 50 different aggregate values, which are integers, booleans and short strings. The cardinality of the user_id
column in an entire year would be around 50 million.
We have 10 years of historical data (around 10 billion rows), and the table should support growing to 10 more years.
The complexity lies in the different type of queries that we need to perform:
-
For a given
day
:-
Aggregate the data of all the 1-5M users of that day, performing
SUM
,MAX
,AVG
and other aggregations on theagg_fields
listed above -
List the top users of that day sorting by one of the
agg_fields
.
-
-
For a given list of
user_id
(from 1 to 500), and a range of days:- Aggregate the data for each user, performing
SUM
,MAX
,AVG
and other aggregations on theagg_fields
listed above - Get all the
agg_fields
for all the days and all the users.
- Aggregate the data for each user, performing
If we use day
as the Columnstore Key and Shard Key we only get good performance on the queries that filter by day
, and if we use user_id
as the Columnstore Key and Shard Key we only get good performance on the queries that filter by user_id
, but we haven’t been able to get good performance on all our queries.
Is the only solution to have two tables containing the same data but with different Columnstore/Shard keys, or is there a better solution?
Thanks a lot!