Home > OS >  Should I reduce not required columns in DFs before join them in Spark?
Should I reduce not required columns in DFs before join them in Spark?

Time:02-17

Is there any sense to reduce not required columns before I join it in Spark data frames? For example: DF1 has 10 columns, DF2 has 15 columns, DF3 has 25 columns. I want to join them, select needed 10 columns and save it in .parquet.

Does it make sense to transform DFs with select only needed columns before the join or Spark engine will optimize the join by itself and will not operate with all 50 columns during the join operation?

CodePudding user response:

Yes, it makes a perfect sense because it reduce the amount of data shuffled between executors. And it's better to make selection of only necessary columns as early as possible - in most cases, if file format allows (Parquet, Delta Lake), Spark will read data only for necessary columns, not for all columns. I.e.:

df1 = spark.read.parquet("file1") \
  .select("col1", "col2", "col3")
df2 = spark.read.parquet("file2") \
  .select("col1", "col5", "col6")
joined = df1.join(df2, "col1")
  • Related