I want to join two datasets with desire out like below:
DATASET 1:
PIN LOCATION
1234 Germany
2356 Poland
2894 England
3452 Bloomberg
DATASET 2:
MAIL STARTLOCATION ENDLOCATION
[email protected] 1234 2894
[email protected] 2356 1234
[email protected] 3452 2894
OUTPUT SHOULD BE:
MAIL STARTLOCATION ENDLOCATION LOCATION1 LOCATION2
[email protected] 1234 2894 Germany England
[email protected] 2356 1234 poland Germany
[email protected] 3452 2894 Bloomberg England
Tried:
condi = [((df1.PIN == df2.STARTLOCATION) | (df1.PIN == df2.ENDLOCATION))]
joindata = df1.join(df2, on = condi, how = 'outer').select('*')
But it gives NULL
in LOCATION1
and LOCATION2
CodePudding user response:
You will have to join it two times to get the desired result because you are expecting two additional columns in your result.
df1.show()
---- ---------
| PIN| LOCATION|
---- ---------
|1234| Germany|
|2356| Poland|
|2894| England|
|3452|Bloomberg|
---- ---------
df2.show()
------------- ------------- -----------
| MAIL|STARTLOCATION|ENDLOCATION|
------------- ------------- -----------
| [email protected]| 1234| 2894|
| [email protected]| 2356| 1234|
|[email protected]| 3452| 2894|
------------- ------------- -----------
df2.join(df1, df2.STARTLOCATION==df1.PIN)\
.withColumnRenamed("LOCATION", "LOCATION1")\
.drop("PIN")\
.join(df1, df2.ENDLOCATION==df1.PIN)\
.withColumnRenamed("LOCATION", "LOCATION2")\
.drop("PIN")\
.show()
------------- ------------- ----------- --------- ---------
| MAIL|STARTLOCATION|ENDLOCATION|LOCATION1|LOCATION2|
------------- ------------- ----------- --------- ---------
|[email protected]| 3452| 2894|Bloomberg| England|
| [email protected]| 1234| 2894| Germany| England|
| [email protected]| 2356| 1234| Poland| Germany|
------------- ------------- ----------- --------- ---------
CodePudding user response:
If you don't mind using sql statement to achieve it, you can try:
df1.createOrReplaceTempView('tmp1')
df2.createOrReplaceTempView('tmp2')
sql = """
select a.MAIL,a.STARTLOCATION,a.ENDLOCATION,b.LOCATION as LOCATION1,c.LOCATION as LOCATION2
from tmp2 a join tmp1 b on a.STARTLOCATION=b.PIN
join tmp1 c on a.ENDLOCATION=c.PIN
"""
df = spark.sql(sql)
df.show(truncate=False)