I am new to Pyspark so that is why I am stuck with the following:
I have 5 dataframes and each dataframes has the same Primary Key called concern_code. I need to outer join all this dataframes together and need to drop the 4 columns called concern_code from the 4 dataframes.
For example: Dataframe Df1 outer joins Df2 based on concern_code Dataframe Df1 outer joins Df3 based on concern_code and so on..
My current Pyspark syntax looks like this:
df1.join(df2,["concern_code"])\
.join(df3,df1["concern_code"] == df3["concern_code"])\
.join(df4,df1["concern_code"] == df4["concern_code"])\
.join(df5,df1["concern_code"] == df5["concern_code"])\
.show()
How do I need to fix the syntax to perform outer join and to have a final version of a new dataframe that has only one column of concern_code ?
CodePudding user response:
You are close. let's say you have following dfs:
d = [
("a", 5.2),
("b", 10.4),
("c", 7.8),
("d", 11.2),
]
df1 = spark.createDataFrame(d, ['concern_code','value'])
df2 = spark.createDataFrame(d, ['concern_code','value1'])
df3 = spark.createDataFrame(d, ['concern_code','value2'])
df4 = spark.createDataFrame(d, ['concern_code','value3'])
df5 = spark.createDataFrame(d, ['concern_code','value4'])
df1.show()
# output
------------ -----
|concern_code|value|
------------ -----
| a| 5.2|
| b| 10.4|
| c| 7.8|
| d| 11.2|
------------ -----
(
df1
.join(df2,on="concern_code", how="outer")
.join(df3,on="concern_code", how="outer")
.join(df4,on="concern_code", how="outer")
.join(df5,on="concern_code", how="outer")
.show()
)
# output
------------ ----- ------ ------ ------ ------
|concern_code|value|value1|value2|value3|value4|
------------ ----- ------ ------ ------ ------
| c| 7.8| 7.8| 7.8| 7.8| 7.8|
| d| 11.2| 11.2| 11.2| 11.2| 11.2|
| a| 5.2| 5.2| 5.2| 5.2| 5.2|
| b| 10.4| 10.4| 10.4| 10.4| 10.4|
------------ ----- ------ ------ ------ ------
CodePudding user response:
If you join two data frames on columns then the columns will be duplicated, as in your case. So I would suggest to use an array of strings, or just a string, i.e. 'id', for joining two or more data frames.
The code below should not duplicate the column names:
df1.join(df2,on='id', how='outer')\
.join(df3,on='id', how='outer')\
.join(df4,on='id', how='outer')\
.join(df5,on='id' how='outer')\
.show()