Hey
For context:
- We track users, they each have an id:
user_id
. - For each user we track when they use our app, each instance of app usage is referred to as a session. A user can have many sessions. Each session have an id:
session_id
.
Currently, our primary use cases are focusing on analyzing sessions, therefore we have sharded the majority of our tables on session_id
. Performance is great and all is good.
Now we’ve been tasked with also focusing on analyzing users. This means that our sharding on session_id
is no longer optimal, as we cant be sure that all sessions related to a user is on the same partition. We could change the sharding such that we shard on user_id
, but then I doubt the query planner would know, that all session_id
s within the partition would actually be unique to that partition, which would cause all of our current queries to be less optimal. Using (user_id, session_id)
as the shard key, probably wouldn’t be of much help either, as the user_id
would still be spread across different partitions, right?
I’ll try and give a very simplified example of the issue…
Using session_id
as shard key, I assume data would be distributed sort of like this:
+-------------------------------------------+
| user_id | session_id | log_id | partition |
+-------------------------------------------+
| 1 | 1 | 11 | 1 |
| 1 | 1 | 12 | 1 |
| 1 | 1 | 13 | 1 |
| 1 | 2 | 32 | 2 |
| 2 | 3 | 65 | 3 |
| 3 | 4 | 68 | 4 |
+-------------------------------------------+
With this I can easily do count(distinct session_id)
locally, but doing count(distinct user_id)
needs to have data distributed between nodes.
Using user_id
as shard key, I assume data would be distributed sort of like this:
+-------------------------------------------+
| user_id | session_id | log_id | partition |
+-------------------------------------------+
| 1 | 1 | 11 | 1 |
| 1 | 1 | 12 | 1 |
| 1 | 1 | 13 | 1 |
| 1 | 2 | 32 | 1 |
| 2 | 3 | 65 | 2 |
| 3 | 4 | 68 | 3 |
+-------------------------------------------+
With this I can easily do count(distinct user_id)
locally, but doing count(distinct session_id)
needs to have data distributed between nodes.
Using (user_id, session_id)
as shard key, I assume data would be distributed sort of like this:
+-------------------------------------------+
| user_id | session_id | log_id | partition |
+-------------------------------------------+
| 1 | 1 | 11 | 2 |
| 1 | 1 | 12 | 2 |
| 1 | 1 | 13 | 2 |
| 1 | 2 | 32 | 3 |
| 2 | 3 | 65 | 1 |
| 3 | 4 | 68 | 3 |
+-------------------------------------------+
With this doing either count(distinct session_id)
or count(distinct user_id)
needs to have data distributed between nodes, because the query planner cant know for sure that a session_id
is always tied to the same user_id
.
So I guess my question is, how do i define my shard key, such that both count(distinct session_id)
and count(distinct user_id)
can be done locally on the partition? Is it possible to state some relation between two columns like:
create table log (
id bigint not null,
user_id bigint not null,
session_id bigint not null unique by (user_id),
timestamp datetime not null series timestamp,
shard key (user_id),
sort key (timestamp),
primary key (id, session_id, user_id)
);
Also just want to note that creating a projection table with user_id
as the shard key instead of session_id
, is currently not an option as that would double the size of data and therefore also storage cost (right?).