There used to be a 5th section with a query to help you calculate what the data distribution for a given table would look like given a different column(s) as the shard key. I see now that this has been removed. I have an archived version saved, but I’m curious to know why this was deleted. Were those instructions not accurate? Is there another way to estimate data skew given a candidate shard key?
It has been confirmed that those instructions are accurate. We have listed the sample code is below and hope you find it helpful.
A desired outcome of sharding is to have an even distribution of records across partitions, so we recommend selecting a shard key with high cardinality.
SELECT WITH(leaf_pushdown=true) SUM(c) rows, PARTITION_ID() FROM (SELECT count(*) c FROM <table_name> GROUP BY <potential_shard_key>) reshuffle GROUP BY PARTITION_ID();
We recommend running this query to check for data skew before making shard key changes. If your data is evenly distributed across partitions, and all of your keys are variables that are often queried/filtered on, then it is a good selection.
Thanks! I did have this saved, just found it odd that it was removed from the documentation. It’s definitely been helpful for me when deciding on a new shard key.