If your predicates are always appid==key_name, then try this:
DELETE FROM tags where appid=key_name and appid IN (XXX, YYY, ...);
I think this should compile faster and probably execute faster also (IN list can be more easily optimized than multiple OR conditions) Also, unlike multiple OR conditions, if you modify the IN list, the query wouldn’t need to be recompiled.
Also, as @hason hinted, query time includes compilation and execution time. Compilation time is only one-time cost per new query. If you run a query twice, it doesn’t need to incur the cost of compilation. More complex queries take longer to compile.
For the first time when executed the query, it took around 120 sec, from the next time it’s taking around 102 sec. So the 102 is the execution time I hope.
Both the appId and key_name are different data, so can’t use the IN clause. And almost 102 sec it’s taking for execution while it’s taking around 8 sec for compilation for the query mentioned above.
Consider also a multi-table, set-oriented DELETE statement, based on this syntax:
DELETE tbl_name FROM table_references
[WHERE expr]
[LIMIT row_count]
Like
create table tmp(appid int, key_name varchar(64));
-- insert the constants from the long OR condition into table tmp
insert into tmp values (const1_1, const1_2), (const2_1, const2_2), ...;
delete tags from tags, tmp
where tags.appid = tmp.appid and tags.key_name = tmp.key_name;
Also, MemSQL 7.0 (currently in beta) has improved performance for finding individual rows in a columnstore. That might help too.
Note that for each row deleted, MemSQL 6.8 must scan a 1M row chunk (segment) to find the row before marking it deleted. Depending on the query, MemSQL 7.0 may be able to seek into the segment to find the row (a.k.a. “sub-segment access”) and not have to scan the whole thing. Hash indexes on columnstores in 7.0 can help as well but they only work on single columns so that may not be useful to you since your lookups are on two columns. You could conceivably put the concatenated value of both columns in one column and put a hash index on it, but that would require some potentially complex changes to your app.
Row stores with regular skip list indexes on the lookup columns of course can delete data much faster.
Thanks hanson, the first solution gave a better performance. I want to try the second solution too, but was unable to create HASH index on columnstore .
I tried creating index like
create unique index test_hash ON tags2 (id) USING HASH;
but getting an error
ER_MEMSQL_FEATURE_LOCKDOWN: Feature ‘INDEX on COLUMNAR table’ is not supported by MemSQL.
We do not support unique index in 7.0.
In beta 1, we also do not support alter table add index or create index on an existing table (although they will be supported in our next beta and in 7.0 GA).
So for now you will have to create a new table with a secondary hash index, and dump in your data.
Try something like
create table t (a int, b int, key(a) using clustered columnstore, key(b) using hash)
There is also a known bug in beta 1 about columnstore secondary index interaction with alter. Do not run any alter on a table with secondary hash index.
Query not able to scan using the secondery hash index created.
Our table defination :
create table tags2
(
id INT not null,
sub_id int,
key_name varchar(150) not null,
key_value_string varchar(501),
appid varchar(50),
KEY (appid,key_name,key_value_string,sub_id) USING CLUSTERED COLUMNSTORE ,
KEY (id ) USING HASH
);
From your profile output, it seems like hash index is working.
segments_skipped: 7
This is telling you that 7 segments are skipped. A segment may be skipped by segment elimination or hash index filter. For your specific query, segment elimination probably is not going to work, so hash index is the only explanation that could eliminate the segment.
We are currently (as of beta 1) not showing any direct information about the use of secondary hash index in profile or explain, so it’s kind of tricky. But in general if you are seeing a lot of segments being skipped and your filter does not match the sort key, then you should be using secondary hash index properly.
If you have 1000 elements in your IN clause, then we have to do 1000 secondary hash index seeks and compute their union. This is probably not going to be faster than doing a full table scan.