I'm learning about sharding approaches. How to achieve good horizontal scalability with a large number of shards in an IO-heavy application. Below I describe a case that I expect to see in my app. I think that this would be a relatively common in the wild, however, I was unable to find much info on it.
Let's say that we need to shard a table/collection where each row is associated with a client. All queries will include a single client id (uuid). Updates and reads are mostly evenly distributed among clients.
From what I've read in this case I would want to use a hashed sharding key on the client id. Reads would touch a single shard providing best performance. Writes would be evenly distributed as long as clients produce relatively the same load.
But what to do if there is a very small subset of clients that produce so much IO load that a single shard would have trouble handling it?
If we change the sharding key for a random record ID then writes for all clients would be distributed across all shards. But reads would have to hit all shards which is not efficient, especially when there are a lot of them.
How do we achieve a balance: have average clients be evenly distributed, and at the same time allow large clients to occupy multiple shards? Are there any DB solutions that would be able to do this automatically? Or do we have to write custom logic for tracking DB load and redistributing large clients between shards? What should I read on the topic?
CodePudding user response:
I'd suggest adding a new attribute to the client's records, for example we could call it part
. Assign a single value to simple clients, and store the same value in part
for all their records.
But heavy clients would be assigned multiple values for part
, up to the number of shards. Every record for that client would set its part
to one of these values. Assign them either randomly or round-robin, however you think is most efficient. The point being to use each part with approximately even frequency.
Your hashing algorithm for mapping clients to a shard would then use the client id the part
attribute. So each simple client would still store all their data on a single shard. But heavy clients will distribute their data over multiple shards.
This does mean that for the heavy clients, a read query would need to search multiple shards. Code your searches to loop over the part
values for the client. For most clients, this loop will only need to execute once. For the heavy clients, the loop will execute once for each part
value associated with that client.
To be honest, I've never seen a load so great that this would be necessary. It's more likely that the traffic for one client is too much for one database instance because the queries are not optimized well, or the application is running more queries than it should. It's important to make sure you analyze query efficiency before you make your sharding architecture more complex.
CodePudding user response:
You've tagged your question with cockroachdb
so you probably already suspect this, but CockroachDB handles sharding transparently. If your primary key is composite and the first column is the client id, data with the same client id will all fall in a contiguous key range, and therefore be generally stored on the same node. If a range gets bigger than a configurable limit, and/or gets much more traffic, CockroachDB will automatically split the range to rebalance storage and traffic across nodes. You'll mostly not have to pay attention to this, and for your pattern you won't want to do any explicit sharding. However, if you do need to inspect or tweak the behavior there are tools to do so such as SHOW RANGES.