I have a function below for generating the rows of a huge text file.
def generate_content(n):
for _ in range(n):
yield 'xxx'
Instead of saving the file to disk, then uploading it to S3, is there any way to save the data directly to S3?
One thing to mention is the data could be so huge that I don't have enough disk space or memory to hold it.
CodePudding user response:
Yes, you can use the boto3 library to upload the data generated by your function directly to S3 without saving it to disk first. Here is an example of how you might use the boto3
library to do this:
import boto3
def generate_content(n):
for _ in range(n):
yield 'xxx'
s3 = boto3.client('s3')
def upload_to_s3(bucket_name, key):
content = generate_content(100) #Generate 100 rows
s3.upload_fileobj(content, bucket_name, key)
upload_to_s3('my-bucket', 'my-key.txt')
The upload_fileobj
method allows you to upload a file-like object, in this case the generator returned by your generate_content
function. So, every time the generator yields a value, it is uploaded directly to S3.
Also you can use put_object
or upload_part
methods to upload a large file, boto3 will automatically handle the splitting of the file into parts and the parallel upload of those parts.
CodePudding user response:
boto3 needs a file, a bytes array, or a file like object to upload an object to S3. Of those, the only one that you can reasonably use that doesn't require the entire contents of the object in memory or on disk is the file like object, using a custom file object helper to satisfy the read requests.
Basically, you can call into your generator to satisfy the requests to read()
, and boto3 will take care creating the object for you:
import boto3
def generate_content(n):
for i in range(n):
yield 'xxx'
# Convert a generator that returns a series of strings into
# a object that implements 'read()' in a method similar to how
# a file object operates.
class GenToBytes:
def __init__(self, generator):
self._generator = generator
self._buffers = []
self._bytes_avail = 0
self._at_end = False
# Emulate a file object's read
def read(self, to_read=1048576):
# Call the generate to read enough data to satisfy the read request
while not self._at_end and self._bytes_avail < to_read:
try:
row = next(self._generator).encode("utf-8")
self._bytes_avail = len(row)
self._buffers.append(row)
except StopIteration:
# We're all done reading
self._at_end = True
if len(self._buffers) > 1:
# We have more than one pending buffer, concat them together
self._buffers = [b''.join(self._buffers)]
# Pull out the requested data, and store the rest
ret, self._buffers = self._buffers[0][:to_read], [self._buffers[0][to_read:]]
self._bytes_avail -= len(ret)
return ret
s3 = boto3.client('s3')
generator = generate_content(100) #Generate 100 rows
s3.upload_fileobj(GenToBytes(generator), bucket, key)