Can using incremental (but unique) IDs as a partition key create hot partitions in DynamoDB?-CodePudding

According to the docs, I understand that DynamoDB will take the value of the provided partition key and put it through a hashing function to decide which physical location the data should go to.

Does this mean writing items with a partition key that is sequential but still unique will produce a hot partition key?

For example, will inserting items with partition key values 10001, 10002, 10003, 10004 allow for even distribution of data across the partitions?

Or will randomly generating a partition key value, like a UUID, make it more evenly distributed?

CodePudding user response：

DynamoDB supports two different kinds of primary keys:

Partition Key
Partition Key Sort Key

Partition Key

If you have a primary key with only a partition key, you will never encounter a hot partition problem as in a table that has only a partition key, no two items can have the same partition key value.

Your keys are always unique, DynamoDB's internal hash function will always output unique hashes & all of your data will then always be distributed evenly across the logical and physical partitions.

For example, this is the MD5 hash for 10001: d89f3a35931c386956c1a402a8e09941

This is the MD5 hash for 10002: 9103c8c82514f39d8360c7430c4ee557

Even though 10001 has only been incremented by 1, the entire hash is different and is in no way similar to the MD5 hash for 10002.

From a consistent hashing point of view, there is no difference between UUID values or incremental values.

Partition Key Sort Key

If you have a primary key that also contains a sort key, you can have hot partition problems if you're not careful as now you can have duplicate partition key values.

All items with the same partition key value are physically stored together, in sorted order by sort key value.

If you do not have as distinctly as possible primary keys, you can create hot partitions.

Let me give you an example:

An eCommerce website decides to design their orders table like so, with the current date being the partition key and the sort key being the item ID:

 --------------- ---------- 
| Partition Key | Sort Key |
 --------------- ---------- 
| 19/10/2021    | item3000 |
| 19/10/2021    | item3001 |
| 20/10/2021    | item4000 |
 --------------- ----------

That may be working perfectly fine at this scale - in the above example, they process 1000 items a day & this is working fine.

Black Friday - 26/11/2021 - arrives & they now have more than 20000 orders on one day:

 --------------- ----------- 
| Partition Key | Sort Key  |
 --------------- ----------- 
| 26/10/2021    | item6000  |
| 26/10/2021    | item15000 |
| 26/10/2021    | item27000 |
| 27/10/2021    | item27100 |
 --------------- -----------

This will create a massive hot partition problem as all of the 20000 orders on 26/10/2021 are now written to only one single partition key value (as I mentioned, items with the same partition key will be stored together).

The 26/11/2021 partition key will be heavily requested & hot, slowing database performance as you will be trying to process orders and ultimately, you will lose out on revenue due to slow application performance.

The table should be designed in a way to allow for more distinct primary key values in relation to the total primary key count (total items) - write sharding (random or calculated) would prevent this issue if dates must be used as a partition key.

If you do not have a sort key as part of your primary key, do not worry about hot partitions.

If you do have a sort key as part of your primary key, design your table schema in a way that the combination of your partition sort key is as unique and distinctive as possible to avoid hot partitions.