How to add data to a parquet file in the most optimal way using pyspark?-CodePudding

I have a parquet file called customerActions. Every day I add 1000 lines there using this syntax:

spark.sql('select * from customerActions').write.mode('append').parquet("/Staging/Mind/customerActions/")

And now I'm faced with the following problem: reading this file takes a lot of time due to the fact that this file contains a lot of files, because every day I add a small amount of data to this file "/Staging/Mind/customerActions/"

How can I make reading the file "/Staging/Mind/customerActions/" faster?

CodePudding user response：

One way to improve this speed is to coalesce small files into larger ones so you can read the existing path into dataframe then run repartition on it and then write it back which should ideally improve performance of reads

CodePudding user response：

If you are ok with having multiple folders you could use DataFrameWriter.partitionBy to group customerActions from e.g. a certain week into one directory. I usually partition data by year/month/day but any other criterion is also possible (see this example).

When you want to read the files you could just read a sub-set of the data and/or parallelize reading which should both make it faster.