How to copy files from one bucket to another without getting out of memory error (OOM)-CodePudding

I have successfully created code that compares the files in two buckets and only copies the files from the source bucket that aren't in the destination bucket. Works great on my sample folder with only 15 objects in it. When I try to do it on my folders with my real data in it I am getting Out of Memory Errors, I get the "Killed" message. I thought that since it was doing this one at a time, essentially, that it shouldn't have this problem. lol, problem probably is the fact that I am using paginate, and saving that data, so lots of data. I've re-written this so many times and this solution is so simple and works great on a small amount of data. This is why I hate dumbing down my data, because reality is different. Do I have to go back to using list_objects_v2? How do I get through this? Thanks!

s3=boto3.resource('s3')
paginator = s3_client.get_paginator("list_objects_v2")
bucket1='bucket1_name'
bucket2='bucket2_name'
pages1=paginator.paginate(Bucket=bucket1,Prefix='test')
pages2=paginator.paginate(Bucket=bucket2,Prefix='test')
bucket1_list=[]
for page1 in pages1:
    for obj1 in page1['Contents']:
        obj1_key=obj1['Key']
        if not obj1_key.endswith("/"):
            bucket1_list.append(obj1_key)
bucket2_list=[]
for page2 in pages2:
    for obj2 in page2['Contents']:
        obj2_key=obj2['Key']
        if not obj2_key.endswith("/"):
            bucket2_list.append(obj2_key)
# Compares what keys from bucket1 aren't in bucket2
# and states what needs to be done to bucket2 to make them equal.
diff_bucket12=diff(bucket2_list,bucket1_list)

tocopy=[]
for i in bucket1_list:
    if i not in bucket2_list:
        tocopy.append(i)

# COPY
copy_to_bucket=s3.Bucket(bucket2)
copy_source=dict()
for i in tocopy:
    copy_source['Bucket']=bucket1
    copy_source['Key']=i
    response=copy_to_bucket.copy(copy_source,i)

CodePudding user response：

You don't need to generate a complete list of all items to compare them.

Since APIs that enumerate an AWS S3 bucket will always do so in lexicographical order order, you can enumerate both buckets, and compare each key in turn. If they differ, that means the bucket with the key before the other is unique, so handle that case, and compare the next item with the current item in the other bucket.

For instance, this little script will compare two buckets, and show the differences, including showing cases where the same key name appears in both buckets, but differs in sizes:

#!/usr/bin/env ptyhon3

import boto3

def enumerate_bucket(s3, bucket):
    # Wrap a paginator so only one page is loaded at a time, and 
    # downstream users can call this to get all items in a bucket
    # without worrying about the pages
    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=bucket):
        for cur in page.get("Contents", []):
            yield cur
    # Return one final empty object to signal the end to avoid
    # needing to look for the StopIteration exception
    yield None

def compare_buckets(s3, bucket_a, bucket_b):
    # Setup two enumerations, and grab the first object
    worker_a = enumerate_bucket(s3, bucket_a)
    worker_b = enumerate_bucket(s3, bucket_b)
    item_a = next(worker_a)
    item_b = next(worker_b)

    # Keep trying to compare items till we hit one end
    while item_a is not None and item_b is not None:
        if item_a['Key'] == item_b['Key']:
            # Same item, return an entry if they differ in size
            if item_a['Size'] != item_b['Size']:
                yield item_a['Key'], item_b['Key'], 'Changed'
            # Move to the next item for each bucket
            item_a = next(worker_a)
            item_b = next(worker_b)
        elif item_a['Key'] < item_b['Key']:
            # Bucket A has the item first in order, meaning
            # it's unique to this bucket
            yield item_a['Key'], None, 'Bucket A only'
            item_a = next(worker_a)
        else:
            # Bucket B has the item first in order, meaning
            # it's unique to this bucket
            yield None, item_b['Key'], 'Bucket B only'
            item_b = next(worker_b)

    # All done, return any remaining items in either bucket
    # that didn't finish already since they're unique.  Only
    # one of these two while loops will do anything, or 
    # neither if both buckets are at the end
    while item_a is not None:
        yield item_a['Key'], None, 'Bucket A only'
        item_a = next(worker_a)

    while item_b is not None:
        yield None, item_b['Key'], 'Bucket B only'
        item_b = next(worker_b)


# Now just loop through, show the different items, logic inside this
# for loop can trigger based off the values to handle unique items, or changed
# items
s3 = boto3.client('s3')
for key_a, key_b, status in compare_buckets(s3, 'example-bucket-1', 'example-bucket-2'):
    print(f"{key_a} -> {key_b} ({status})")