Detecting and Resolving Data Skew

erica.anderson · June 30, 2022, 4:55pm

Hi there, I have a question about this page in the docs: Detecting and Resolving Data Skew

There used to be a 5th section with a query to help you calculate what the data distribution for a given table would look like given a different column(s) as the shard key. I see now that this has been removed. I have an archived version saved, but I’m curious to know why this was deleted. Were those instructions not accurate? Is there another way to estimate data skew given a candidate shard key?

Thank you

MariaSilverhardt · July 5, 2022, 10:11pm

Hi Erica! Hope your holiday weekend was great!

It has been confirmed that those instructions are accurate. We have listed the sample code is below and hope you find it helpful.

A desired outcome of sharding is to have an even distribution of records across partitions, so we recommend selecting a shard key with high cardinality.

SELECT WITH(leaf_pushdown=true) SUM(c) rows, PARTITION_ID() FROM (SELECT count(*) c FROM <table_name> GROUP BY <potential_shard_key>) reshuffle GROUP BY PARTITION_ID();

We recommend running this query to check for data skew before making shard key changes. If your data is evenly distributed across partitions, and all of your keys are variables that are often queried/filtered on, then it is a good selection.

Please keep us posted on how it goes. Thanks

erica.anderson · July 5, 2022, 10:58pm

Thanks! I did have this saved, just found it odd that it was removed from the documentation. It’s definitely been helpful for me when deciding on a new shard key.

hanson · July 8, 2022, 12:58am

I asked our documentation team to consider adding this back if it makes sense. Thanks for the feedback.

galley · July 8, 2022, 1:45pm

We’ve improved the text a bit and reposted the information: SingleStoreDB Cloud · SingleStore Documentation

Thank you for posting!