Query plan optimization for columnstore filters

aaron.pfeifer · December 3, 2018, 3:00am

Hi -

I’m running into a performance issue with a query on one of my columnstore tables. The crux of the issue seems to be around how I look up the set of foreign keys to filter columns by in the columnstore table. This is what my table looks like:

CREATE TABLE `facts_offer_metrics` (
  `app_id` bigint(20) NOT NULL,
  `metric_id` bigint(20) NOT NULL,
  `offer_id` bigint(20) NOT NULL,
  `epoch` bigint(20) NOT NULL,
  `value` decimal(20,5) NOT NULL,
  KEY `epoch` (`epoch`,`offer_id`,`metric_id`,`app_id`) /*!90619 USING CLUSTERED COLUMNSTORE */ /*!90621 WITH(COLUMNSTORE_SEGMENT_ROWS=1024000) */,
  /*!90618 SHARD */ KEY `epoch_shard_key` (`epoch`,`offer_id`,`metric_id`)
) /*!90621 AUTOSTATS_ENABLED=TRUE */ |

This is my current query:

select 
  sum(value), 
  epoch, 
  metric_id 
from 
  facts_offer_metrics f 
  inner join dim_metrics_ref on dim_metrics_ref.id = f.metric_id 
  inner join dim_offers_ref on dim_offers_ref.id = f.offer_id 
where 
  f.epoch between 1543604400 and 1543690800
  and dim_metrics_ref.metric in ('a', 'b', 'c')
  and dim_offers_ref.offer_id in ('d', 'e', 'f')
group by 
  epoch, 
  metric_id;

The explain for this query is:

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                          |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project [`sum(value)`, remote_0.epoch, remote_0.metric_id]                                                                                                                                                                                                       |
| HashGroupBy [SUM(remote_0.`sum(value)`) AS `sum(value)`] groups:[remote_0.epoch, remote_0.metric_id]                                                                                                                                                             |
| Gather partitions:all alias:remote_0                                                                                                                                                                                                                             |
| Project [`sum(value)`, f.epoch, f.metric_id]                                                                                                                                                                                                                     |
| HashGroupBy [SUM(f.value) AS `sum(value)`] groups:[f.epoch, f.metric_id]                                                                                                                                                                                         |
| Filter [dim_offers_ref.offer_id IN (...)]                                                                                                                                                                                                                        |
| NestedLoopJoin                                                                                                                                                                                                                                                   |
| |---IndexSeek report_service.dim_offers_ref, PRIMARY KEY (id) scan:[id = f.offer_id] est_table_rows:182,510 est_filtered:2                                                                                                                                       |
| Filter [dim_metrics_ref.metric IN (...)]                                                                                                                                                                                                                         |
| NestedLoopJoin                                                                                                                                                                                                                                                   |
| |---IndexSeek report_service.dim_metrics_ref, PRIMARY KEY (id) scan:[id = f.metric_id] est_table_rows:138 est_filtered:35                                                                                                                                        |
| Filter [f.epoch >= 1543604400 AND f.epoch <= 1543690800]                                                                                                                                                                                                         |
| ColumnStoreScan report_service.facts_offer_metrics AS f, KEY epoch (epoch, offer_id, metric_id, device_country_id, creative_id, app_id, store_id) USING CLUSTERED COLUMNSTORE WITH(COLUMNSTORE_SEGMENT_ROWS=1024000) est_table_rows:5,674,880,274 est_filtered:1 |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

This query takes about 10s to execute. However, if I rewrite it using explicit foreign key filters like so:

select 
  sum(value), 
  epoch, 
  metric_id 
from 
  facts_offer_metrics f 
  inner join dim_metrics_ref on dim_metrics_ref.id = f.metric_id 
  inner join dim_offers_ref on dim_offers_ref.id = f.offer_id 
where 
  f.metric_id in (1, 2, 3)
  and f.epoch between 1543604400 and 1543690800
  and f.offer_id in (4, 5, 6)
group by 
  epoch, 
  metric_id;

…the performance is 100x faster (0.1s). The explain for that query is:

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                     |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project [`sum(value)`, remote_0.epoch, remote_0.metric_id]                                                                                                                                                                                                                  |
| HashGroupBy [SUM(remote_0.`sum(value)`) AS `sum(value)`] groups:[remote_0.epoch, remote_0.metric_id]                                                                                                                                                                        |
| Gather partitions:all alias:remote_0                                                                                                                                                                                                                                        |
| Project [`sum(value)`, f.epoch, f.metric_id]                                                                                                                                                                                                                                |
| HashGroupBy [SUM(f.value) AS `sum(value)`] groups:[f.epoch, f.metric_id]                                                                                                                                                                                                    |
| Filter [dim_metrics_ref.id IN (...)]                                                                                                                                                                                                                                        |
| NestedLoopJoin                                                                                                                                                                                                                                                              |
| |---IndexSeek report_service.dim_metrics_ref, PRIMARY KEY (id) scan:[id = f.metric_id] est_table_rows:138 est_filtered:35                                                                                                                                                   |
| HashJoin [f.offer_id = dim_offers_ref.id]                                                                                                                                                                                                                                   |
| |---Project [f_0.value, f_0.epoch, f_0.metric_id, f_0.offer_id] est_rows:91,343 alias:f                                                                                                                                                                                     |
| |   Filter [f_0.offer_id IN (...) AND f_0.epoch >= 1543604400 AND f_0.epoch <= 1543690800 AND f_0.metric_id IN (...)]                                                                                                                                                       |
| |   ColumnStoreScan report_service.facts_offer_metrics AS f_0, KEY epoch (epoch, offer_id, metric_id, device_country_id, creative_id, app_id, store_id) USING CLUSTERED COLUMNSTORE WITH(COLUMNSTORE_SEGMENT_ROWS=1024000) est_table_rows:5,674,880,274 est_filtered:91,344 |
| BloomFilter table:f fields:dim_offers_ref.id                                                                                                                                                                                                                                |
| IndexSeek report_service.dim_offers_ref, PRIMARY KEY (id) scan:[id IN (...)] est_table_rows:182,510 est_filtered:2                                                                                                                                                          |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I’ve tried rewriting the query using inline subselects, but I still see poor performance. e.g.

select 
  sum(value) as value, 
  epoch,
  metric_id
from 
  facts_offer_metrics f 
where 
  f.metric_id in (select id from dim_metrics_ref where metric in ('a', 'b', 'c'))
  and f.epoch between 1543604400 and 1543690800
  and f.offer_id in (select id from dim_offers_ref where offer_id in ('d', 'e', 'f'))
group by 
  epoch, 
  metric_id;

Ideally I’d like our client application code to not have to look up the foreign keys before building the query and instead rely on some form of a subselect to achieve equivalent performance.

I did find that adding with(table_convert_subselect=true) to the joins on the first query improved performance to 2s, but that’s still substantially slower than what the second query accomplishes.

Any thoughts on optimizer hints or other ways to structure the query?

Thanks!

rob · December 3, 2018, 3:43am

I think your histogram on the epoch column from facts_offer_metrics may be out of date. Can you try updating it by running analyze table facts_offer_metrics and see if that fixes the plan for the original query?

aaron.pfeifer · December 3, 2018, 2:58pm

That definitely made a difference

The explain statement is now:

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                   |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project [`sum(value)`, remote_0.epoch, remote_0.metric_id]                                                                                                                                                                                                                |
| HashGroupBy [SUM(remote_0.`sum(value)`) AS `sum(value)`] groups:[remote_0.epoch, remote_0.metric_id]                                                                                                                                                                      |
| Gather partitions:all alias:remote_0                                                                                                                                                                                                                                      |
| Project [`sum(value)`, f.epoch, f.metric_id]                                                                                                                                                                                                                              |
| HashGroupBy [SUM(f.value) AS `sum(value)`] groups:[f.epoch, f.metric_id]                                                                                                                                                                                                  |
| HashJoin [f.offer_id = dim_offers_ref.id]                                                                                                                                                                                                                                 |
| |---Project [dim_offers_ref_0.id] est_rows:1 alias:dim_offers_ref                                                                                                                                                                                                         |
| |   IndexSeek report_service.dim_offers_ref AS dim_offers_ref_0, UNIQUE KEY dim_offers_offer_id_key (offer_id) scan:[offer_id IN (...)] est_table_rows:182,510 est_filtered:2                                                                                             |
| HashJoin [f.metric_id = dim_metrics_ref.id]                                                                                                                                                                                                                               |
| |---Project [dim_metrics_ref_0.id] est_rows:35 alias:dim_metrics_ref                                                                                                                                                                                                      |
| |   IndexSeek report_service.dim_metrics_ref AS dim_metrics_ref_0, UNIQUE KEY dim_metrics_metric_key (metric) scan:[metric IN (...)] est_table_rows:138 est_filtered:35                                                                                                   |
| Filter [f.epoch IN (...)]                                                                                                                                                                                                                                                 |
| BloomFilter table:dim_offers_ref fields:f.offer_id                                                                                                                                                                                                                        |
| BloomFilter table:dim_metrics_ref fields:f.metric_id                                                                                                                                                                                                                      |
| ColumnStoreScan report_service.facts_offer_metrics AS f, KEY epoch (epoch, offer_id, metric_id, device_country_id, creative_id, app_id, store_id) USING CLUSTERED COLUMNSTORE WITH(COLUMNSTORE_SEGMENT_ROWS=1024000) est_table_rows:5,708,300,400 est_filtered:32,308,349 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The analyze table command is pretty intensive on our cluster (pegs the leaf nodes for 20m+). Seems like that command has a bit of an impact on queries while we’re running, but likely worth the slowdown.

I’m not quite sure if the performance of the original query after the table has been analyzed quite matches the performance of the explicit foreign key filters. Is there anything else you can think of to optimize there? Otherwise maybe that’s something we could explore on our client-side logic in the future.

aaron.pfeifer · December 3, 2018, 3:50pm

As a follow-up, I do find that querying with explicit foreign key lists on the columnstore columns does still provide better performance (2-3x) than querying through join table filters. You can see there’s still a bit of a difference in how the query gets executed when you compare the two explain statements.

rob · December 3, 2018, 5:46pm

If you share the profile for the two versions of the query, we can see if there’s anything else we can do to improve it.

6.7 has a bunch of performance improvements to this type of join on columnstore tables. You could also try upgrading to 6.7 and see what performance you get there. You can see some examples in the first section here.

Re “analyze table” performance, this is an issue we’re aware of that should be addressed in the next major release.