Hi -
I appear to be running into an issue where the plan for a query against one of my columnstore tables in optimized significantly differently based on the values in a WHERE condition. In one case, a query is made against the table which matches zero rows. In another case, a query is made that matches multiple rows. Depending on which query is run first dramatically changes the query plan and, as a result, the performance.
For example, if I run the zero-row query first, the compiled query time is 0.01s for the zero-row query and 2s+ for the multi-row query.
If I run the multi-row query first, the compiled query time is 0.01s for both it and the zero-row query.
The explain for the zero-row query is:
| Project [GROUP_CONCAT(facts_merged.value SEPARATOR ','), GROUP_CONCAT(facts_merged.epoch SEPARATOR ','), facts_merged.metric] |
| HashGroupBy [GROUP_CONCAT(facts_merged.value SEPARATOR ',') AS values, GROUP_CONCAT(facts_merged.epoch SEPARATOR ',') AS epochs] groups:[facts_merged.metric] |
| TableScan 2tmp AS facts_merged storage:list stream:yes |
| Project [value, facts.epoch, facts.metric] |
| HashGroupBy [SUM(facts.value) AS value] groups:[facts.epoch, facts.metric] |
| TableScan 1tmp AS facts storage:list stream:yes |
| Project [value, remote_0.epoch, remote_0.metric] |
| HashGroupBy [SUM(remote_0.value) AS value] groups:[remote_0.epoch, remote_0.metric] |
| Gather partitions:all alias:remote_0 |
| Project [value, f.epoch, dim_metrics_ref.metric] |
| HashGroupBy [SUM(f.value) AS value] groups:[f.epoch, dim_metrics_ref.metric] |
| Filter [dim_offers_ref.offer_id IN (...)] |
| NestedLoopJoin |
| |---IndexSeek report_service.dim_offers_ref, PRIMARY KEY (id) scan:[id = f.offer_id] est_table_rows:182,510 est_filtered:2 |
| Filter [dim_metrics_ref.metric IN (...)] |
| NestedLoopJoin |
| |---IndexSeek report_service.dim_metrics_ref, PRIMARY KEY (id) scan:[id = f.metric_id] est_table_rows:138 est_filtered:35 |
| Filter [f.epoch IN (...)] |
| ColumnStoreScan report_service.facts_offer_metrics_staging_rewrite AS f, KEY epoch (epoch, offer_id, metric_id, device_country_id, creative_id, app_id, store_id) USING CLUSTERED COLUMNSTORE WITH(COLUMNSTORE_SEGMENT_ROWS=1024000) est_table_rows:31,414,353 est_filtered:1 |
The explain for the multi-row query is:
| Project [GROUP_CONCAT(facts_merged.value SEPARATOR ','), GROUP_CONCAT(facts_merged.epoch SEPARATOR ','), facts_merged.metric] |
| HashGroupBy [GROUP_CONCAT(facts_merged.value SEPARATOR ',') AS values, GROUP_CONCAT(facts_merged.epoch SEPARATOR ',') AS epochs] groups:[facts_merged.metric] |
| TableScan 2tmp AS facts_merged storage:list stream:yes |
| Project [value, facts.epoch, facts.metric] |
| HashGroupBy [SUM(facts.value) AS value] groups:[facts.epoch, facts.metric] |
| TableScan 1tmp AS facts storage:list stream:yes |
| Project [value, remote_0.epoch, remote_0.metric] |
| HashGroupBy [SUM(remote_0.value) AS value] groups:[remote_0.epoch, remote_0.metric] |
| Gather partitions:all alias:remote_0 |
| Project [value, f.epoch, dim_metrics_ref.metric] |
| HashGroupBy [SUM(f.value) AS value] groups:[f.epoch, dim_metrics_ref.metric] |
| HashJoin [f.metric_id = dim_metrics_ref.id] |
| |---Project [dim_metrics_ref_0.metric, dim_metrics_ref_0.id] est_rows:35 alias:dim_metrics_ref |
| | IndexSeek report_service.dim_metrics_ref AS dim_metrics_ref_0, UNIQUE KEY dim_metrics_metric_key (metric) scan:[metric IN (...)] est_table_rows:138 est_filtered:35 |
| HashJoin [f.offer_id = dim_offers_ref.id] |
| |---Project [dim_offers_ref_0.id] est_rows:1 alias:dim_offers_ref |
| | IndexSeek report_service.dim_offers_ref AS dim_offers_ref_0, UNIQUE KEY dim_offers_offer_id_key (offer_id) scan:[offer_id IN (...)] est_table_rows:182,510 est_filtered:2 |
| Filter [f.epoch IN (...)] |
| BloomFilter table:dim_metrics_ref fields:f.metric_id |
| BloomFilter table:dim_offers_ref fields:f.offer_id |
| ColumnStoreScan report_service.facts_offer_metrics_staging_rewrite AS f, KEY epoch (epoch, offer_id, metric_id, device_country_id, creative_id, app_id, store_id) USING CLUSTERED COLUMNSTORE WITH(COLUMNSTORE_SEGMENT_ROWS=1024000) est_table_rows:31,414,353 est_filtered:31,414,353 |
I’ve tried rewriting the query in various ways, but haven’t been able to force the query optimized to use the correct plan for the zero-row query. I’m not quite sure how I can guarantee that the right plan gets used since this is controlled by the order in which queries come into the database.
I’d love any thoughts you might have. Thanks!