I have all my data available in S3 location s3://sample/input_data
I do my ETL by deploying AWS EMR and using PySpark.
PySpark script is very simple.
- I load
s3://sample/input_data
as spark dataframe. - Partition it by one column.
- Save the dataframe as Parquet file with write option in 'append' mode into S3 location
s3://sample/output_data
- Then copy all files in
s3://sample/input_data
tos3://sample/archive_data
and delete all data ins3://sample/input_data
So when a new data comes in s3://sample/input_data
, it only process the new file and save it in s3://sample/output_data
with partition.
Is there any inbuilt latch AWS EMR provides that I should be aware of which I can use it instead of doing the last step of my PySpark script?
CodePudding user response:
You could either use Delta Lake for those purposes or partition your input directory by a time interval like s3://sample/input_data/year=2021/month=11/day=11/
so that you only process data from that time interval.