i want to implement an aws lambda function that will execute the following python script:
directory = os.fsencode(directory_in_string)
def transform_csv(csv):
for file in os.listdir(directory):
filename = os.fsdecode(file)
d = open(r'C:\Users\r.reibold\Documents\GitHub\groovy_dynamodb_api\historische_wetterdaten\{}'.format(filename))
data = json.load(d)
df_historical = pd.json_normalize(data)
#Transform to datetime
df_historical["dt"] = pd.to_datetime(df_historical["dt"], unit='s', errors='coerce').dt.strftime("%m/%d/%Y %H:%M:%S")
df_historical["dt"] = pd.to_datetime(df_historical["dt"])
.
.
.
.
My question is now:
How do i have to change the os. commands because i need to reference to the s3 bucket and not my local directory?
My first attempt looks like this
DIRECTORY = 's3://weatherdata-templates/historische_wetterdaten/New/'
BUCKET = 'weatherdata-templates'
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=BUCKET, Prefix=DIRECTORY)
def lambda_handler(event, context):
for page in pages:
for obj in page['Contents']:
filename = s3.fsdecode(obj)
d = open(r's3://102135091842-weatherdata-templates/historische_wetterdaten/New/{}'.format(filename))
data = json.load(d)
df_historical = pd.json_normalize(data)
.
.
.
Am i on the right track or completely wrong? Thx.
CodePudding user response:
Not quite there yet :)
Unfortunately, you can't do open(...)
directly on an S3 URL as it's not a file object.
To load the object contents without storing the file locally, try using the S3 Boto3 resource which provides higher-level access to the S3 SDK.
- Get the key of the object from
obj['Key']
. - Use
obj.get()['Body']
to get the contents as aStreamingBody
- Call
.read()
on theStreamingBody
to get the object in byte format & decode it to a UTF-8 string (or any other encoding that your file(s) is in) - Convert JSON string to a JSON object using
json.loads(...)
import boto3
s3_resource = boto3.resource('s3')
...
def lambda_handler(event, context):
for page in pages:
for obj in page['Contents']:
obj_reference = s3_resource.Object(BUCKET, obj['Key'])
body = obj_reference.get()['Body'].read().decode('utf-8')
data = json.loads(body)
df_historical = pd.json_normalize(data)
...