I am generating a large file in Python from an asynchronous queue that transforms many units of data and appends them (unordered) into a large file.
The final destination of this file is S3. To save I/O and dead time (wait for file to be complete before uploading), I would like to avoid writing the file to local disk first, and just stream the data to S3 as they are generated.
The units are all of different size but I can specify a reasonable max chunk size that is larger than any unit.
Most of the examples I see on the Web (e.g. https://medium.com/analytics-vidhya/aws-s3-multipart-upload-download-using-boto3-python-sdk-2dedb0945f11) describe how to do a multi-part upload with boto3 from a file, not from data generated at runtime.
Is this possible, and a recommended approach?
EDIT: I removed the "multi-part" term from the title because I realized it could be misleading. What I really need is serial streaming of data chunks.
Thanks.
CodePudding user response:
The upload()
method of the MultipartUploadPart
object accepts a parameter Body
that can either be a file-like object or a bytes object, which is what you want.
Take a look at the documentation.
CodePudding user response:
(Answering the updated question of streaming vs uploading a file)
I don't think it is possible to start an upload for a completely unknown amount of data (streaming).
All the upload functions either take bytes or a seekable file-like-object (see the documentation for S3.object.put(). This is most likely because they need to know the size of the data prior to actually transmitting it.
You could however consider saving each result as a single object in S3 and assembling it into one large file only when downloading it. But that would require a special programm to download the data and it might also increase the costs due to a higher number of requests and objects.