Home > Net >  How to perform incremental load using AWS EMR (Pyspark) the right way?
How to perform incremental load using AWS EMR (Pyspark) the right way?

Time:11-18

I have all my data available in S3 location s3://sample/input_data

I do my ETL by deploying AWS EMR and using PySpark.

PySpark script is very simple.

  • I load s3://sample/input_data as spark dataframe.
  • Partition it by one column.
  • Save the dataframe as Parquet file with write option in 'append' mode into S3 location s3://sample/output_data
  • Then copy all files in s3://sample/input_data to s3://sample/archive_data and delete all data in s3://sample/input_data

So when a new data comes in s3://sample/input_data, it only process the new file and save it in s3://sample/output_data with partition.

Is there any inbuilt latch AWS EMR provides that I should be aware of which I can use it instead of doing the last step of my PySpark script?

CodePudding user response:

You could either use Delta Lake for those purposes or partition your input directory by a time interval like s3://sample/input_data/year=2021/month=11/day=11/ so that you only process data from that time interval.

  • Related