S3 Bucket AWS CLI takes forever to get specific files-CodePudding

I have a log archive bucket, and that bucket has 2.5m objects.

I am looking to download some specific time period files. For this I have tried different methods but all of them are failing.

My observation is those queries start from oldest file, but the files I seek are the newest ones. So it takes forever to find them.

aws s3 sync s3://mybucket  . --exclude "*" --include "2021.12.2*" --include "2021.12.3*" --include "2022.01.01*"

Am I doing something wrong?
Is it possible to make these query start from newest files so it might take less time to complete?

I also tried using S3 Browser and CloudBerry. Same problem. Tried with a EC2 that is inside the same AWS network. Same problem.

CodePudding user response：

2.5m objects in an Amazon S3 bucket is indeed a large number of objects!

When listing the contents of an Amazon S3 bucket, the S3 API only returns 1000 objects per API call. Therefore, when the AWS CLI (or CloudBerry, etc) is listing the objects in the S3 bucket it requires 2500 API calls. This is most probably the reason why the request is taking so long (and possibly failing due to lack of memory to store the results).

You can possibly reduce the time by specifying a Prefix, which reduces the number of objects returned from the API calls. This would help if the objects you want to copy are all in a sub-folder.

Failing that, you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then extract from that CSV file a list of objects you want to copy (eg use Excel or write a program to parse the file). Then, specifically copy those objects using aws s3 cp or from a programming language. For example, a Python program could parse the script and then use download_file() to download each of the desired objects.

The simple fact is that a flat-structure Amazon S3 bucket with 2.5m objects will always be difficult to list. If possible, I would encourage you to use 'folders' to structure the bucket so that you would only need to list portions of the bucket at a time.