By way of background, I have a tuple schema à la Datomic, with one table per value type:
e = entity ID (UUID stored as binary(16))
a = attribute ID (UUID stored as binary(16))
v = value (type depends on the table)
t = transaction ID (UUID stored as binary(16))
p = partition ID (mediumint).
So I had a query that included this:
select e, a, v
from datoms_string
where ( p = ?
and e in ( convert(?, binary)
, convert(?, binary)
, ...)
and a = convert(?, binary))
This portion of the query ended up using the avt index which required a scan of ~500K rows. At a higher level, I was looking to get the titles of a particular set of products. MemSQL ended up scanning all the titles of every product (a) rather than just looking at the particular set of 50 products I was interested in (e). The overall query took ~90 seconds after compilation. I thought for a bit and tried this approach:
(select e, a, v
from datoms_string
where ( p = ?
and e = convert(?, binary)
and a = convert(?, binary)))
union all
(select e, a, v
from datoms_string
where ( p = ?
and e = convert(?, binary)
and a = convert(?, binary)))
union all
...
This portion of the query ended up using the eavt index which required a scan of only ~500 rows (IIRC, all the rows associated with the products in question).
This brings me to my question: is there a way for MemSQL to automatically optimize for this kind of query? I’m using MemSQL 7.0 and have enabled analysis for all the tables in the DB.
Thanks so much!
(Also I got some MemSQL swag in the mail and I really appreciate the gesture! I’ve worn the socks already )
Can you send us a profile of these queries? I see a runtime decision between three possible indexes in the EXPLAIN in either case (evat, primary key, avt) and those decisions are usually pretty accurate.
In case you’ve never looked at profiles in the past run:
PROFILE ;
SHOW PROFILE;
Perhaps the issue was transient. Not sure if the data statistics automatically updated to make better runtime decisions or what, but what I’m seeing now on MemSQL Studio is that the two queries now take the same to execute and both use the eavt index, rather than the first using avt and the second using eavt. Wish I could have taken a snapshot of the profiling I did yesterday in MemSQL Studio (this is where I found this runtime index choice anomaly).
Happy to give you a profile if you like, but it doesn’t seem especially useful now that the query performance is the same. Apologies for not being more helpful.
The runtime decision I was referring to is based on a quick probing of the indexes to get an idea of which one is more selective (each index is a skiplist so we can probe the higher towers to get an idea of how many rows the scan will return). Its not using histograms.