I have an S3 bucket with a bunch of zip files. I want to decompress the zip files and for each decompressed item, I want to create an $item_file.csv.gz
and save it to another S3 bucket. I was thinking of creating a Glue job for it but I don't know how to begin with. Any leads?
Eventually, I would like to terraform my solution and it should be triggered whenever there are new files in the S3 bucket,
Would a Lambda function or any other service be more suited for this?
CodePudding user response:
From an architectural point of view, it depends on the file size of your ZIP files - if the process takes less than 15 minutes, then you can use Lambda functions.
If more, you will hit the current 15 minute Lambda timeout so you'll need to go ahead with a different solution.
However, for your use case of triggering on new files, S3 triggers will allow you to trigger a Lambda function when there are files created/deleted from the bucket.
I would recommend to segregate the ZIP files into their own bucket otherwise you'll also be paying for checking to see if any file uploaded is in your specific "folder" as the Lambda will be triggered for the entire bucket (it'll be negligible but still worth pointing out). If segregated, you'll know that any file uploaded is a ZIP file.
Your Lambda can then download the file from S3 using download_file
(example provided by Boto3 documentation), unzip it using zipfile
& eventually GZIP compress the file using gzip
.
You can then upload the output file to the new bucket using upload_object
(example provided by Boto3 documentation) & then delete the original file from the original bucket using delete_object
.
Terraforming the above should also be relatively simple as you'll mostly be using the aws_lambda_function
& aws_s3_bucket
resources.
Make sure your Lambda has the correct execution role with the appropriate IAM policies to access both S3 buckets & you should be good to go.