Home > Net >  How to merge multiple pyspark dataframes where all dataframes have same index columns?
How to merge multiple pyspark dataframes where all dataframes have same index columns?

Time:04-23

I created pyspark dataframes using below code:

df_1 = sqlContext.createDataFrame(df_1)
df_2 = sqlContext.createDataFrame(df_2)
df_3 = sqlContext.createDataFrame(df_3)
df_4 = sqlContext.createDataFrame(df_4)

Each of these dataframe has 2 common columns which we can use as index columns and the other columns are different. I want to merge both the dataframes.

df_1

Firm Product Date Parameter-A
Firm-A Product-1 30-01-2019 3

df_2

Firm Product Date Parameter-B
Firm-A Product-1 30-01-2019 4

df_3

Firm Product Date Parameter-C
Firm-A Product-1 30-01-2019 5

df_4

Firm Product Date Parameter-D
Firm-A Product-1 30-01-2019 6

I want the resulting dataframe to look like below.

df_result

Firm Product Date Parameter-A Parameter-B Parameter-C Parameter-D
Firm-A Product-1 30-01-2019 3 4 5 6

I tried below code but its only working for pandas dataframe.

import pandas as pd
from functools import reduce

df = [df_1, df_2, df_3, df_4]
df_result = reduce(lambda left,right: pd.merge(left,right, how= 'left', on=['Firm', 'Product']), df)

CodePudding user response:

Do it like this:

merged_df = df1.join(df2,on=['Firm', 'Product'], how='outer')\
   .join(df3,on=['Firm', 'Product'], how='outer')\
   .join(df4,on=['Firm', 'Product'], how='outer')\
   .join(df5,on=['Firm', 'Product']  how='outer')\
   .na.drop()
   .show()

display(merged_df)

  • Related