I'm writing entries into a DynamoDB table:
import time
...
for item in my_big_map.items():
Ddb_model(column1=item[0], column2=item[1], column_timestamp=time.time()).save()
I suspect this is slow so I was thinking about using a multi-threading strategy such as concurrent.futures
to write each entry to the table:
def write_one_entry(item):
Ddb_model(column1=item[0], column2=item[1], column_timestamp=time.time()).save()
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(write_one_entry, my_big_map.items())
However, I've found this way of doing batch writes in PynamoDB's documentation. It looks like it's a handy way to accelerate write operation.
Does it also use a multi-threading strategy?
Is the PynamoDB implementation better than using concurrent.futures
to do bulk writes?
CodePudding user response:
I suspect this is slow
Correct, you're not taking advantage of the BatchWriteItem
API which allows you to write up to 16 MB of data (or a max of 25 creation/delete requests).
It is essentially a bulk of PutItem
and/or DeleteItem
requests (note that you cannot update an item via BatchWriteItem
however). Not using this API means that you are losing out on performance & network improvements that come with AWS combining the update operations in one go.
Does it also use a multi-threading strategy?
No, it doesn't need to particularly - an interface to the bulk API is all that is needed.
The main speed improvement will come from batch processing on AWS's side, not locally.
Is the PynamoDB implementation better than using
concurrent.futures
to do bulk writes?
Yes because it is important that the bulk API is actually used, not how the data is iterated, for the maximum benefit.
Your concurrent.futures
implementation will be faster than your original code but still doesn't take advantage of the BatchWriteItem
API. You are speeding up how you're calling AWS but you're still sending a request per item in my_big_map.items()
and that is what will be taking up the most time.
PynamoDB seems to be using the bulk API from taking a look at the source code regardless of whether you use context managers or iterators so you will be better off using the PynamoDB implementation which will also handle pagination of items etc. for you under the hood.
The important part is that you use the BatchWriteItem
API, which will give you the speed improvement you are looking for.
PynamoDB's batch writing will let you do this (as well as AWS's Boto3).