Hey there!
I have a columnstore table where I’m trying to delete a set of rows based on all of the columns in the columnstore table matching the rows in my batch.
In my scenario, the columnstore table is 25B rows and the batch I want to delete consists of 1,000 rows. I thought the fastest way to do this would be to load the batch into a temporary table (with the same shard key as the columnstore table) and do a delete via a join with that table.
For example:
delete f
from facts f
inner join facts_to_delete d on
f.partner_id = d.partner_id and
f.app_id = d.app_id and
...
f.epoch = d.epoch;
Running this query takes approximately 7s to complete. I tried to introduce some filters (e.g. on epoch
) knowing what the values on that column would be and that seemed to have somewhat of an impact:
delete f
from facts f
inner join facts_to_delete d on
f.partner_id = d.partner_id and
f.app_id = d.app_id and
...
f.epoch = d.epoch
where
f.epoch in (1548493200);
This query takes 5s to complete instead of 7s. What’s curious, though, is that a select
statement with the same join only takes 500ms:
select *
from facts f
inner join facts_to_delete d on
f.partner_id = d.partner_id and
f.app_id = d.app_id and
...
f.epoch = d.epoch
where
f.epoch in (1548493200);
In this case, there were no actual records matching in the facts
table.
Our tables are structured like so:
CREATE TABLE `facts` (
`partner_id` bigint(20) NOT NULL,
`app_id` bigint(20) NOT NULL,
...
`epoch` bigint(20) NOT NULL,
`value` decimal(20,5) NOT NULL,
KEY `app_id` (`partner_id`,`app_id`,`epoch`,...) /*!90619 USING CLUSTERED COLUMNSTORE */ /*!90621 WITH(COLUMNSTORE_SEGMENT_ROWS=102400, COLUMNSTORE_FLUSH_BYTES=2097152) */,
/*!90618 SHARD */ KEY `foreign_key_shard_key` (`partner_id`,`app_id`,`epoch`)
) /*!90621 AUTOSTATS_ENABLED=TRUE */ |
The temporary table is structured identically in terms of the shard key.
As an alternative to the temporary table, I also tried to run an individual delete
statement for each row. For 1,000 rows it took approximately the same amount of time as using a join. It appears as though running individual statements is faster for fewer rows and slower when deleting a lot of rows.
So I’d love to get thoughts on how I could optimize the delete. I’m also curious why the select
performance is an order of magnitude faster than the delete
performance even though it’s presumably looking at the same rows.
Thanks for any advice!