Automate CSV file cleaning with AWS lambda using a trigger-CodePudding

I am trying to create a Lambda function which will clean automatically csv files from an S3 bucket. The S3 bucket receives files every 5mn, and I have therefore created a trigger for the Lambda function. To clean the csv files I will use pandas library to create a dataframe. I have already installed a pandas layer. When creating a dataframe, there is an error message. This is my code:

import json
import boto3
import pandas as pd
from io import StringIO


#call s3 bucket
client = boto3.client('s3')

def lambda_handler(event, context):
    
    #define bucket_name and object _name
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    object_name = event['Records'][0]['s3']['object']['key']
    
    #create a df from the object
    df = pd.read_csv(object_name)

This is the error message:

[ERROR] FileNotFoundError: [Errno 2] No such file or directory: 'object_name'

On Cloudwatch it additionally says:

OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k

Has anyone experienced the same issues? Thanks in advance for all your help!

CodePudding user response：

You have to use the s3 client to download the file from s3 before using pandas. Something like:

response = client.get_object(Bucket=bucket_name, Key=object_name)
df = pd.read_csv(response["Body"])

You'll have to make sure lambda has the right permissions to access the s3 bucket.

CodePudding user response：

Change this line:

df = pd.read_csv("object_name")

to this:

df = pd.read_csv(object_name)

CodePudding user response：

Cause of error

object_name is just a relative path(key) of the s3 object with respect to bucket and it has no significance without the bucket_name hence when you are trying to read the csv file you are getting FileNotFoundError

Solution for the error

In order to properly refer the s3 object you have to construct the fully qualified s3 path from bucket_name and object_name. Also notice that the object key has some quoted characters so before constructing the fully qualified path you have to unquote them.

from urllib.parse import unquote_plus

def lambda_handler(event, context):
    
    #define bucket_name and object _name
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    object_name = event['Records'][0]['s3']['object']['key']
    
    #create a df from the object
    filepath = f's3://{bucket_name}/{unquote_plus(object_name)}'
    df = pd.read_csv(filepath)