import os
import glob
from pyspark.sql import functions as F
result = get_csv_files("/test")
d= {}
i = 1
# loop over the list of csv files
for k,v in result.items():
df = spark.read.option("header", True).csv(k)
d[i] =df
i = i 1
I can read all csv files in folder using the get_csv_file function and output a dictionary of key (filename) and value (dataframe) pairs.
I want to join all the dataframes from the above code using two columns present in all files ("colA" and "colB"). Is it possible in a loop instead of hard coding like below?
df_merged = d[1].join(d[2],["colA", "colB"], "outer").join(d[3],["colA", "colB"], "outer") \
.join(d[4],["colA", "colB"], "outer").join(d[5],["colA", "colB"], "outer") \
CodePudding user response:
You can use python's reduce
function on dict's values, if all join keys are same.
joined_sdf = reduce(lambda x, y: x.join(y, ['colA', 'colB'], 'outer'), d.values())