Python: create a file on S3-CodePudding

I have a function below for generating the rows of a huge text file.

def generate_content(n):
    for _ in range(n):
        yield 'xxx'

Instead of saving the file to disk, then uploading it to S3, is there any way to save the data directly to S3?

One thing to mention is the data could be so huge that I don't have enough disk space or memory to hold it.

CodePudding user response：

Yes, you can use the boto3 library to upload the data generated by your function directly to S3 without saving it to disk first. Here is an example of how you might use the boto3 library to do this:

import boto3

def generate_content(n):
    for _ in range(n):
        yield 'xxx'

s3 = boto3.client('s3')

def upload_to_s3(bucket_name, key):
    content = generate_content(100) #Generate 100 rows
    s3.upload_fileobj(content, bucket_name, key)

upload_to_s3('my-bucket', 'my-key.txt')

The upload_fileobj method allows you to upload a file-like object, in this case the generator returned by your generate_content function. So, every time the generator yields a value, it is uploaded directly to S3.

Also you can use put_object or upload_part methods to upload a large file, boto3 will automatically handle the splitting of the file into parts and the parallel upload of those parts.

CodePudding user response：

boto3 needs a file, a bytes array, or a file like object to upload an object to S3. Of those, the only one that you can reasonably use that doesn't require the entire contents of the object in memory or on disk is the file like object, using a custom file object helper to satisfy the read requests.

Basically, you can call into your generator to satisfy the requests to read(), and boto3 will take care creating the object for you:

import boto3

def generate_content(n):
    for i in range(n):
        yield 'xxx'

# Convert a generator that returns a series of strings into 
# a object that implements 'read()' in a method similar to how
# a file object operates.
class GenToBytes:
    def __init__(self, generator):
        self._generator = generator
        self._buffers = []
        self._bytes_avail = 0
        self._at_end = False

    # Emulate a file object's read    
    def read(self, to_read=1048576):
        # Call the generate to read enough data to satisfy the read request
        while not self._at_end and self._bytes_avail < to_read:
            try:
                row = next(self._generator).encode("utf-8")
                self._bytes_avail  = len(row)
                self._buffers.append(row)
            except StopIteration:
                # We're all done reading
                self._at_end = True
        if len(self._buffers) > 1:
            # We have more than one pending buffer, concat them together
            self._buffers = [b''.join(self._buffers)]
        # Pull out the requested data, and store the rest
        ret, self._buffers = self._buffers[0][:to_read], [self._buffers[0][to_read:]]
        self._bytes_avail -= len(ret)
        return ret

s3 = boto3.client('s3')
generator = generate_content(100) #Generate 100 rows
s3.upload_fileobj(GenToBytes(generator), bucket, key)