Home > database >  Sending larger than 5GB dataframe to S3
Sending larger than 5GB dataframe to S3

Time:04-02

I'm attempting to upload a dataframe to S3. The dataframe is created by sourcing several data sources and joining them together in addition to performing a few transformation. The operations are done fully in memory. I also need to store the data into s3 where each line is a json record as such:

{"key_1": "value_11", "key_2": "value_12", ...}
{"key_1": "value_21", "key_2": "value_22", ...}
...

I was using put_object() and had no issue until the tables got larger.

Code Snippet

...

json_buffer = StringIO()
df_copy.to_json(json_buffer, orient="records", lines=True)
json_buffer.seek(0)

# self.__s3.put_object(
#     Bucket=bucket, 
#     Body=json_buffer.getvalue(), 
#     Key=key_json)

GB = 1024 ** 3
# Ensure that multipart uploads only happen if the size of a transfer
# is larger than S3's size limit for nonmultipart uploads, which is 5 GB.
upl_config = TransferConfig(multipart_threshold=5*GB)

self.__s3.upload_fileobj(
    json_buffer,
    Bucket=bucket, 
    Key=key_json,
    Config=upl_config)

With the above code I get the following error:

TypeError: Unicode-objects must be encoded before hashing

I've tried the following approaches:

Question: Is there anyway to save the data as json records and not a csv like the other question?

CodePudding user response:

looks like you need to pass in a buffer of bytes not str. try using BytesIO instead of StringIO.

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_fileobj

  • Related