How to perform incremental load using AWS EMR (Pyspark) the right way?-CodePudding

I have all my data available in S3 location s3://sample/input_data

I do my ETL by deploying AWS EMR and using PySpark.

PySpark script is very simple.

I load s3://sample/input_data as spark dataframe.
Partition it by one column.
Save the dataframe as Parquet file with write option in 'append' mode into S3 location s3://sample/output_data
Then copy all files in s3://sample/input_data to s3://sample/archive_data and delete all data in s3://sample/input_data

So when a new data comes in s3://sample/input_data, it only process the new file and save it in s3://sample/output_data with partition.

Is there any inbuilt latch AWS EMR provides that I should be aware of which I can use it instead of doing the last step of my PySpark script?

CodePudding user response：

You could either use Delta Lake for those purposes or partition your input directory by a time interval like s3://sample/input_data/year=2021/month=11/day=11/ so that you only process data from that time interval.