I need to copy all files from one prefix in S3 to another prefix within the same bucket. My solution is something like:
file_list = [List of files in first prefix]
for file in file_list:
copy_source = {'Bucket': my_bucket, 'Key': file}
s3_client.copy(copy_source, my_bucket, new_prefix)
However I am only moving 200 tiny files (1 kb each) and this procedure takes up to 30 seconds. It must be possible to do it fasteer?
CodePudding user response:
I would do it in parallel. For example:
from multiprocessing import Pool
file_list = [List of files in first prefix]
print(objects_to_download)
def s3_coppier(s3_file):
copy_source = {'Bucket': my_bucket, 'Key': s3_file}
s3_client.copy(copy_source, my_bucket, new_prefix)
# copy 5 objects at the same time
with Pool(5) as p:
p.map(s3_coppier, file_list)
CodePudding user response:
So you have a function you need to call on a bunch of things, all of which are independent of each other. You could try multiprocessing.
from multiprocessing import Process
def copy_file(file_name, my_bucket):
copy_source = {'Bucket': my_bucket, 'Key': file_name}
s3_client.copy(copy_source, my_bucket, new_prefix)
def main():
file_list = [...]
for file_name in file_list:
p = Process(target=copy_file, args=[file_name, my_bucket])
p.start()
Then they all can start at (approximately) the same time, instead of having to wait for the last file to complete.