Hi -
I’m running into a performance issue with a query on one of my columnstore tables. The crux of the issue seems to be around how I look up the set of foreign keys to filter columns by in the columnstore table. This is what my table looks like:
CREATE TABLE `facts_offer_metrics` (
`app_id` bigint(20) NOT NULL,
`metric_id` bigint(20) NOT NULL,
`offer_id` bigint(20) NOT NULL,
`epoch` bigint(20) NOT NULL,
`value` decimal(20,5) NOT NULL,
KEY `epoch` (`epoch`,`offer_id`,`metric_id`,`app_id`) /*!90619 USING CLUSTERED COLUMNSTORE */ /*!90621 WITH(COLUMNSTORE_SEGMENT_ROWS=1024000) */,
/*!90618 SHARD */ KEY `epoch_shard_key` (`epoch`,`offer_id`,`metric_id`)
) /*!90621 AUTOSTATS_ENABLED=TRUE */ |
This is my current query:
select
sum(value),
epoch,
metric_id
from
facts_offer_metrics f
inner join dim_metrics_ref on dim_metrics_ref.id = f.metric_id
inner join dim_offers_ref on dim_offers_ref.id = f.offer_id
where
f.epoch between 1543604400 and 1543690800
and dim_metrics_ref.metric in ('a', 'b', 'c')
and dim_offers_ref.offer_id in ('d', 'e', 'f')
group by
epoch,
metric_id;
The explain for this query is:
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project [`sum(value)`, remote_0.epoch, remote_0.metric_id] |
| HashGroupBy [SUM(remote_0.`sum(value)`) AS `sum(value)`] groups:[remote_0.epoch, remote_0.metric_id] |
| Gather partitions:all alias:remote_0 |
| Project [`sum(value)`, f.epoch, f.metric_id] |
| HashGroupBy [SUM(f.value) AS `sum(value)`] groups:[f.epoch, f.metric_id] |
| Filter [dim_offers_ref.offer_id IN (...)] |
| NestedLoopJoin |
| |---IndexSeek report_service.dim_offers_ref, PRIMARY KEY (id) scan:[id = f.offer_id] est_table_rows:182,510 est_filtered:2 |
| Filter [dim_metrics_ref.metric IN (...)] |
| NestedLoopJoin |
| |---IndexSeek report_service.dim_metrics_ref, PRIMARY KEY (id) scan:[id = f.metric_id] est_table_rows:138 est_filtered:35 |
| Filter [f.epoch >= 1543604400 AND f.epoch <= 1543690800] |
| ColumnStoreScan report_service.facts_offer_metrics AS f, KEY epoch (epoch, offer_id, metric_id, device_country_id, creative_id, app_id, store_id) USING CLUSTERED COLUMNSTORE WITH(COLUMNSTORE_SEGMENT_ROWS=1024000) est_table_rows:5,674,880,274 est_filtered:1 |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
This query takes about 10s to execute. However, if I rewrite it using explicit foreign key filters like so:
select
sum(value),
epoch,
metric_id
from
facts_offer_metrics f
inner join dim_metrics_ref on dim_metrics_ref.id = f.metric_id
inner join dim_offers_ref on dim_offers_ref.id = f.offer_id
where
f.metric_id in (1, 2, 3)
and f.epoch between 1543604400 and 1543690800
and f.offer_id in (4, 5, 6)
group by
epoch,
metric_id;
…the performance is 100x faster (0.1s). The explain for that query is:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project [`sum(value)`, remote_0.epoch, remote_0.metric_id] |
| HashGroupBy [SUM(remote_0.`sum(value)`) AS `sum(value)`] groups:[remote_0.epoch, remote_0.metric_id] |
| Gather partitions:all alias:remote_0 |
| Project [`sum(value)`, f.epoch, f.metric_id] |
| HashGroupBy [SUM(f.value) AS `sum(value)`] groups:[f.epoch, f.metric_id] |
| Filter [dim_metrics_ref.id IN (...)] |
| NestedLoopJoin |
| |---IndexSeek report_service.dim_metrics_ref, PRIMARY KEY (id) scan:[id = f.metric_id] est_table_rows:138 est_filtered:35 |
| HashJoin [f.offer_id = dim_offers_ref.id] |
| |---Project [f_0.value, f_0.epoch, f_0.metric_id, f_0.offer_id] est_rows:91,343 alias:f |
| | Filter [f_0.offer_id IN (...) AND f_0.epoch >= 1543604400 AND f_0.epoch <= 1543690800 AND f_0.metric_id IN (...)] |
| | ColumnStoreScan report_service.facts_offer_metrics AS f_0, KEY epoch (epoch, offer_id, metric_id, device_country_id, creative_id, app_id, store_id) USING CLUSTERED COLUMNSTORE WITH(COLUMNSTORE_SEGMENT_ROWS=1024000) est_table_rows:5,674,880,274 est_filtered:91,344 |
| BloomFilter table:f fields:dim_offers_ref.id |
| IndexSeek report_service.dim_offers_ref, PRIMARY KEY (id) scan:[id IN (...)] est_table_rows:182,510 est_filtered:2 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I’ve tried rewriting the query using inline subselects, but I still see poor performance. e.g.
select
sum(value) as value,
epoch,
metric_id
from
facts_offer_metrics f
where
f.metric_id in (select id from dim_metrics_ref where metric in ('a', 'b', 'c'))
and f.epoch between 1543604400 and 1543690800
and f.offer_id in (select id from dim_offers_ref where offer_id in ('d', 'e', 'f'))
group by
epoch,
metric_id;
Ideally I’d like our client application code to not have to look up the foreign keys before building the query and instead rely on some form of a subselect to achieve equivalent performance.
I did find that adding with(table_convert_subselect=true)
to the joins on the first query improved performance to 2s, but that’s still substantially slower than what the second query accomplishes.
Any thoughts on optimizer hints or other ways to structure the query?
Thanks!