Download 5M of 1MB-sized archive files from an external FTP server to AWS S3-CodePudding

The problem

I have to download to AWS S3 a lot of .tar.gz files (5 million), each has an approximate size of 1 Mb, stored on an external FTP server (I don't control it).

My try

I have already implemented a solution based on python's concurrent.futures.ThreadPoolExecutor and s3fs modules. I tested it for a subset of 10K files, and it took around 20 minutes for the full process (download using this approach then store on AWS S3 using s3fs). This means that 10,000 / 20 = 500 archives are processed each minute. For 5 million, it would take 5M / 500 = 10,000 minutes of processing = 7 days. I can't afford waiting this time (for time and costs, and I fear the FTP server breaks the connection with my IP).

For that task, I used an r5.metal instance, one of the most powerful in terms of vCPUs (96) and network performances I could find on t he EC2 catalogue.

My questions

So I ask:

What would be the best solution for this problem?
Is there a solution that takes less than one week?
Are there instances that are better than r5.metal for this job?
Is there a cost-effective and scalable dedicated service on AWS?
In this particular case, what's the most adapted between threading, multiprocessing and asyncio (and other solutions)? Same question for downloading 1000 files, each approximately of size 50 Mb.

Any help is much appreciated.

CodePudding user response：

There are two approaches you might take...

Using Amazon EC2

Pass a sub-list of files (100?) to your Python script. Have it loop through the files, downloading each in turn to the local disk. Then, copy it up to Amazon S3 using boto3.

Do not worry about how to write it as threads or do fancy async stuff. Instead, just run lots of those Python scripts in parallel, each with their own list of files to copy. Once you get enough of them running in parallel (just run the script in the background using &, monitor the instance to determine where the bottleneck lies -- you'll probably find that CPU and RAM isn't the problem -- it's more likely to be the remote FTP server that can only handle a certain volume of queries and/or bandwidth of data.

You should then be able to determine the 'sweet spot' to get the fastest throughput with the minimal cost (if that is even a consideration). You could even run multiple EC2 instances in parallel, each running the script in parallel.

Using AWS Lambda

Push a small list of filenames into an Amazon SQS queue.

Then, create an AWS Lambda function that is triggered from the SQS queue. The function should retrieve the files from the FTP server, save to local disk, then use boto3 to copy them to S3. (Make sure to delete the files after uploading to S3, since there is only limited space in a Lambda function container.)

This will use the parallel capabilities of AWS Lambda to perform the operations in parallel. By default, you can run 1000 Lambda functions in parallel, but you can request an increase in this limit.

Start by testing it with a few files pushed into the SQS queue. If that works, send a few thousand messages and see how well it handles the load. You can also play with memory allocations in Lambda, but the minimum level will probably suffice.

Reconciliation

Assume that files will fail to download. Rather than retrying them, let them fail.

Then, after all the scripts have run (in either EC2 or Lambda), do a reconciliation of the files uploaded to S3 with your master list of files. Note that listing files in S3 can be a little slow (it retrieves 1000 per API call) so you might want to use Amazon S3 Inventory, which can provide a daily CSV file listing all objects.

General

Regardless of which approach you take, things will go wrong. For example, the remote FTP server might only allow a limited number of connections. It might have bandwidth limitations. Downloads will randomly fail. Since this is a one-off activity, it's more important to just get the files downloaded than to make the world's best process. If you don't want to wait 34 days for the download, it's imperative that you get something going quickly, so it is at least downloading while you tweak and improve the process.

Good luck! Let us know how you go!