Does the bulk write operation in DynamoDB utilize a multi-threading strategy?-CodePudding

I'm writing entries into a DynamoDB table:

import time
...

for item in my_big_map.items():
    Ddb_model(column1=item[0], column2=item[1], column_timestamp=time.time()).save()

I suspect this is slow so I was thinking about using a multi-threading strategy such as concurrent.futures to write each entry to the table:

def write_one_entry(item):
    Ddb_model(column1=item[0], column2=item[1], column_timestamp=time.time()).save()

with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(write_one_entry, my_big_map.items())

However, I've found this way of doing batch writes in PynamoDB's documentation. It looks like it's a handy way to accelerate write operation.

Does it also use a multi-threading strategy?

Is the PynamoDB implementation better than using concurrent.futures to do bulk writes?

CodePudding user response：

I suspect this is slow

Correct, you're not taking advantage of the BatchWriteItem API which allows you to write up to 16 MB of data (or a max of 25 creation/delete requests).

It is essentially a bulk of PutItem and/or DeleteItem requests (note that you cannot update an item via BatchWriteItem however). Not using this API means that you are losing out on performance & network improvements that come with AWS combining the update operations in one go.

Does it also use a multi-threading strategy?

No, it doesn't need to particularly - an interface to the bulk API is all that is needed.

The main speed improvement will come from batch processing on AWS's side, not locally.

Is the PynamoDB implementation better than using concurrent.futures to do bulk writes?

Yes because it is important that the bulk API is actually used, not how the data is iterated, for the maximum benefit.

Your concurrent.futures implementation will be faster than your original code but still doesn't take advantage of the BatchWriteItem API. You are speeding up how you're calling AWS but you're still sending a request per item in my_big_map.items() and that is what will be taking up the most time.

PynamoDB seems to be using the bulk API from taking a look at the source code regardless of whether you use context managers or iterators so you will be better off using the PynamoDB implementation which will also handle pagination of items etc. for you under the hood.

The important part is that you use the BatchWriteItem API, which will give you the speed improvement you are looking for.

PynamoDB's batch writing will let you do this (as well as AWS's Boto3).