Scan table an run an update script for 10M user on AWS-CodePudding

I am using AWS Dynamodb database that contains over 10M users, we had a mistake when processing some events and we did persist some records the wrong way.

I created a script that scan the table by batch of 250 do the updating job, I run it locally it works and it took 4 days to finish the hole database, I had some issues of AWS session expired each time login logout run the script again until it's done, am tracking also the records in another table so when i restart the script it start from that record and not from the beginning.

Everyone is happy the job is done but am not that's why am here so my QUESTION:

What's the solution using aws services that i can build for the next time this happen to make this faster and not to run it locally.

Any schema design/explanation/suggestions are welcome.

CodePudding user response：

Architecture

AWS Lambda is capable of functionally unlimited horizontal scaling. Lambda can execute for a maximum of 15 minutes, so you just need to give the lambda the right amount of work so it can finish within its configured timeout. In this model, you already have a batch size of 250, so you probably only need some seconds, not 15 minutes, to execute a batch.

Next is a question of telling your lambda what part of the work it's to do. SQS is a great solution here - you can push an SQS message for each batch of work, and have AWS automatically invoke your lambda for each message.

You'll also have to scale up your dynamo db cluster to be able to keep up with the massive volume of reads and writes these lambdas are capable of generating, without impacting production. You can start by throttling the lambdas to a low volume and increasing the allowed concurrency of the lambda function while you monitor your production SLIs to make sure you're not causing any impacts.

Cost of this kind of system is fairly low. 10M/250 = 40000 batches; if every batch is a message and a lambda invoke, you need 40000 invocations, and at least 40000 sqs writes, 40000 sqs reads, and 40000 sqs deletes (in the case of failure, the message would be returned to the queue for reprocessing, which is perfect for your case because the operation sounds idempotent). If you're not using lambda or SQS for anything else, this should fit comfortably within the free tier of lambda/sqs (1M sqs operations; 1M lambda invokes; 400k GB/s of lambda runtime) unless your batch updates take more than 10 seconds each. or requires more than 1GB of memory. Based on 4 days / 40000 batches, sounds like you're running a little under 9 seconds a batch.

Language

This won't matter as much when you're scaling horizontally. But still worth thinking about. Some languages are much, much faster than others, and it can make a big difference when 40,000 batches of operations must be performed. The slowest is using repeaded calls to the aws cli - each one requires instantiation of an entire python runtime, which has quite a lot of overhead. Faster is running all the batches in one process of an interpreted language such as python, because at least then you don't have to start 40,000 proceses with the overhead that entails. But you will still have the overhead of converting all dat between the python layer and the C layer. For one operation this overhead is barely noticable; for 40,000 requests, the overhead can be huge. I assume more performant languages such as Java would deliver similar improvements of performance. I can say from experience that the AWS SDKs for Go will outperform Python by a tremendous amount, and parallelizes well to boot.

CodePudding user response：

AWS glue might be a good option to process your data (https://docs.aws.amazon.com/glue/latest/dg/add-job.html)

You can load in your data from dynamoDB(https://docs.aws.amazon.com/prescriptive-guidance/latest/dynamodb-full-table-copy-options/aws-glue.html) and use a jupyter notebook to write some very simple python code to process the data.

You could even get away with using aws glue databrew which is a visual interface that lets you process your data without the real knowledge of a programming langue (https://aws.amazon.com/glue/features/databrew/)