join dataframes by iterating through dictionary-CodePudding

import os
import glob
from pyspark.sql import functions as F
 
result = get_csv_files("/test")
d= {} 
i = 1
# loop over the list of csv files
for k,v in result.items():
    df =   spark.read.option("header", True).csv(k)
    d[i] =df
    i = i   1

I can read all csv files in folder using the get_csv_file function and output a dictionary of key (filename) and value (dataframe) pairs.

I want to join all the dataframes from the above code using two columns present in all files ("colA" and "colB"). Is it possible in a loop instead of hard coding like below?

df_merged = d[1].join(d[2],["colA", "colB"], "outer").join(d[3],["colA", "colB"], "outer")  \
.join(d[4],["colA", "colB"], "outer").join(d[5],["colA", "colB"], "outer") \

CodePudding user response：

You can use python's reduce function on dict's values, if all join keys are same.

joined_sdf = reduce(lambda x, y: x.join(y, ['colA', 'colB'], 'outer'), d.values())