I'm attempting to upload a dataframe to S3. The dataframe is created by sourcing several data sources and joining them together in addition to performing a few transformation. The operations are done fully in memory. I also need to store the data into s3
where each line is a json
record as such:
{"key_1": "value_11", "key_2": "value_12", ...}
{"key_1": "value_21", "key_2": "value_22", ...}
...
I was using put_object()
and had no issue until the tables got larger.
Code Snippet
...
json_buffer = StringIO()
df_copy.to_json(json_buffer, orient="records", lines=True)
json_buffer.seek(0)
# self.__s3.put_object(
# Bucket=bucket,
# Body=json_buffer.getvalue(),
# Key=key_json)
GB = 1024 ** 3
# Ensure that multipart uploads only happen if the size of a transfer
# is larger than S3's size limit for nonmultipart uploads, which is 5 GB.
upl_config = TransferConfig(multipart_threshold=5*GB)
self.__s3.upload_fileobj(
json_buffer,
Bucket=bucket,
Key=key_json,
Config=upl_config)
With the above code I get the following error:
TypeError: Unicode-objects must be encoded before hashing
I've tried the following approaches:
Question:
Is there anyway to save the data as json
records and not a csv
like the other question?
CodePudding user response:
looks like you need to pass in a buffer of bytes not str. try using BytesIO instead of StringIO.