Hi -
I have a use case where I am running a group by
query on a large number of HLLs that is resulting in anywhere from 1M to 10M rows being returned (each row containing a combined HLL – the raw 16KB, not the cardinality). Since each HLL is stored in a dense layout of 16KB, this consumes a significant amount memory during the query. A query that generates 3M rows requires 50GB of memory to complete.
While supporting HLLs with a sparse layout would resolve this problem for us (our HLLs often have a low cardinality so the query would consume < 100MB), it’s not yet supported in MemSQL as far as I can tell.
Are there any options here to either reduce memory usage or have the query use disk for processing the HashGroupBy
operation? I could theoretically split the query into multiple queries, but I was hoping to avoid that.
Effectively the query looks like so:
select approx_count_distinct_combine(approx_value)
from facts where
account_id = ... and metric_id = ... and approx_value is not null
group by ...;
The output of the profile is:
| Gather partitions:all alias:remote_0 actual_rows: 3,092,050 exec_time: 0ms start_time: 00:00:00.001 end_time: 00:01:13.058 |
| Project [`approx_count_distinct_combine(approx_value)`] actual_rows: 3,092,050 exec_time: 147ms start_time: [00:00:00.981, 00:00:01.780] network_traffic: 50,669,424.000000 KB network_time: 71,506ms |
| HashGroupBy [APPROX_COUNT_DISTINCT_COMBINE(facts.approx_value) AS `approx_count_distinct_combine(approx_value)`] groups:[...] actual_rows: 3,092,050 exec_time: 415ms start_time: [00:00:00.375, 00:00:00.943] memory_usage: 434,111.218750 KB |
| Filter [facts.account_id = ? AND facts.metric_id = ? AND facts.approx_value IS NOT NULL] actual_rows: 3,092,050 exec_time: 1ms start_time: [00:00:00.361, 00:00:00.443] |
| ColumnStoreScan report_service.facts, KEY facts_sort_key (...) USING CLUSTERED COLUMNSTORE WITH(COLUMNSTORE_SEGMENT_ROWS=102400) actual_rows: 15,604,684 exec_time: 782ms start_time: [00:00:00.353, 00:00:00.425] memory_usage: 125,829.117188 KB segments_scanned: 1,221 segments_skipped: 9,164 segments_fully_contained: 0 |
I’ve seen this addressed in different databases by either forcing the query engine to use disk for processing the HashGroupBy
or by using sparse HLL layouts. Would love to hear your thoughts.
Thanks!