I'm using Python 3.7 and trying to read a .dat file from AWS S3 and convert it to one or more CSV on certain logic. We're using mdfreader library in Python.
import mdfreader
import pandas as pd
def convert_mdf_to_csvs(file_name, output_file_loc) :
yop=mdfreader.Mdf(file_name)
yop.convert_to_pandas()
# print(list(yop.keys()))
# print([keys for keys in list(yop.keys()) if keys.endswith("group")])
all_groups_keys = [keys for keys in list(yop.keys()) if keys.endswith("group")]
for keys in all_groups_keys :
print(yop[keys])
timeframe = keys.split("group")[0]
yop[keys].to_csv(str(output_file_loc) str(timeframe) ".csv" )
This above code is working fine in a local machine, but since AWS S3 is object storage so the read will be using boto3, but due to lack of documentation on the mdfreader library side, am not very sure how to pass this read stream into the "yop=mdfreader.Mdf(file_name)" function? Mdf function seems to accept a full file path. I know I can copy that to Lambda's tmp and use it, but since that a hack, I do not want to do that.
Searched quite a bit on SO Q/A but didn't get this clarity for .dat file type read from AWS S3.
Also, is there a better way to solve this, maybe using simple csv library or anything else?
Any help?
CodePudding user response:
The easiest method would be to use download_file()
to download the file from Amazon S3 to /tmp/
on the local disk.
Then, you can use your existing code to process the file. This is definitely not a 'hack' -- it is a commonly used technique. It's certainly more reliable than streaming the file.
There is a limit on the amount of storage available and AWS Lambda containers can be reused, so either delete the temporary file after use, or use the same filename (eg /tmp/temp.dat
) each time so that it overwrites the previous version.