I’m currently performing a query in a table of 10M rows. Even with the sort key and indexes, the following query still scans half of the table to find the result:
select `created_at` from `completed_tasks` where (`user_id` = 1 or `ip_address` = '127.0.0.1') order by `created_at` desc limit 1;
This is the table schema:
CREATE TABLE `completed_tasks` (
`id` char(36) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`user_id` bigint(20) unsigned NOT NULL,
`ip_address` varchar(45) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`task_id` bigint(20) unsigned NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
KEY `completed_tasks_user_id_ip_address_index` (`user_id`,`ip_address`) USING HASH,
KEY `created_at` (`created_at`) USING CLUSTERED COLUMNSTORE,
UNIQUE KEY `PRIMARY` (`id`) USING HASH,
KEY `completed_tasks_user_id_index` (`user_id`) USING HASH,
SHARD KEY `__SHARDKEY` (`id`)
) AUTOSTATS_CARDINALITY_MODE=INCREMENTAL AUTOSTATS_HISTOGRAM_MODE=CREATE AUTOSTATS_SAMPLING=ON SQL_MODE='ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES,NO_AUTO_CREATE_USER'
So sorry to hear that you are having this issue. We’re happy to help.
Can you share whether you are running on self-hosted or managed service and what version?
Hi Miguilim !
The engine will always need to scan a certain number of rows because it rely on the segment size of the columnstore table. You can reduce segment size but it’s not a good recommendation to do it because your data will have less compression rate. Otherwise here is some documentation about how you can reduce segment size to scan less rows : Configuring the Columnstore to Work Effectively · SingleStore Documentation
Regarding the query itself, I would recommend to create an hash index on ip_address too.
Columnstore segments are usually skip regarding the shard key , you may consider changing the shard key if you really want to achieve high performance on this specific query.
Also if you click on the operation ColumnstoreScan you will see how many segments have been skipped. You can also use EXPLAIN at the beginning of your query to see how and which indexes have been used to optimise your query.
What is the speed of the current query and what speed do you want to achieve for this particular query ?
Thanks for the support! I’ve changed the primary key to id,user_id, shard key to user_id and created an index to ip_address. I also changed the sort key created_at to DESC.
Now it’s working way better, but it is still not ideal. This is a very frequent query that should be optimized in the best possible way.
I tried changing the columnstore_segment_rows variable to a lower value (now it’s set to 102400), but it didn’t help at all. The query just increased the number of segments scanned.
We could help you further if you send us a csv file (with anonymized data) and your latest schema for that table. Feel free to shoot me an email at managed@singlestore.com . We will reply with help on this forum to make sure everyone can leverage that solution.
Sorry but I didn’t understand what you mean by the CSV file, you mean a table dump? If so, it’s about 10M rows. What is the best to send you this through email?
About the schema, here is the latest version (the latest Visual Explain is using this schema):
CREATE TABLE `completed_tasks` (
`id` char(36) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`user_id` bigint(20) unsigned NOT NULL,
`ip_address` varchar(45) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`operation_id` bigint(20) unsigned NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
UNIQUE KEY `PRIMARY` (`id`,`user_id`) USING HASH,
KEY `created_at` (`created_at` DESC) USING CLUSTERED COLUMNSTORE,
KEY `completed_faucet_claims_ip_address_index` (`ip_address`) USING HASH,
SHARD KEY `__SHARDKEY` (`user_id`)
) AUTOSTATS_CARDINALITY_MODE=INCREMENTAL AUTOSTATS_HISTOGRAM_MODE=CREATE AUTOSTATS_SAMPLING=ON SQL_MODE='ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES,NO_AUTO_CREATE_USER';
Hello @smargareto, thanks for the suggestion! This approach doesn’t make much sense to me. I’ve tried with this sort key and setting the shard key back to ID, and now the query is scanning the entire table.
Hello @ganesh.usi.gk , I hadn’t thought about this kind of approach. Really interesting! But I’ve tried with it, and it’s scanning the same amount and working exactly as the default one.
About the count one, the query is returning only 7. This was already expected, it’s only a small portion of data per user!