Home > front end >  join pysprak data to desire output
join pysprak data to desire output

Time:01-04

I want to join two datasets with desire out like below:

DATASET 1:

PIN           LOCATION
1234          Germany
2356          Poland
2894          England
3452          Bloomberg

DATASET 2:

MAIL           STARTLOCATION      ENDLOCATION
[email protected]        1234             2894
[email protected]        2356             1234
[email protected]       3452             2894

OUTPUT SHOULD BE:

MAIL           STARTLOCATION      ENDLOCATION  LOCATION1       LOCATION2
[email protected]        1234             2894       Germany         England
[email protected]        2356             1234       poland          Germany
[email protected]       3452             2894      Bloomberg        England

Tried:

condi = [((df1.PIN == df2.STARTLOCATION) | (df1.PIN == df2.ENDLOCATION))]

joindata = df1.join(df2, on = condi, how = 'outer').select('*')

But it gives NULL in LOCATION1 and LOCATION2

CodePudding user response:

You will have to join it two times to get the desired result because you are expecting two additional columns in your result.

df1.show()
 ---- --------- 
| PIN| LOCATION|
 ---- --------- 
|1234|  Germany|
|2356|   Poland|
|2894|  England|
|3452|Bloomberg|
 ---- --------- 

df2.show()
 ------------- ------------- ----------- 
|         MAIL|STARTLOCATION|ENDLOCATION|
 ------------- ------------- ----------- 
| [email protected]|         1234|       2894|
| [email protected]|         2356|       1234|
|[email protected]|         3452|       2894|
 ------------- ------------- ----------- 

df2.join(df1, df2.STARTLOCATION==df1.PIN)\
    .withColumnRenamed("LOCATION", "LOCATION1")\
    .drop("PIN")\
    .join(df1, df2.ENDLOCATION==df1.PIN)\
    .withColumnRenamed("LOCATION", "LOCATION2")\
    .drop("PIN")\
    .show()
 ------------- ------------- ----------- --------- --------- 
|         MAIL|STARTLOCATION|ENDLOCATION|LOCATION1|LOCATION2|
 ------------- ------------- ----------- --------- --------- 
|[email protected]|         3452|       2894|Bloomberg|  England|
| [email protected]|         1234|       2894|  Germany|  England|
| [email protected]|         2356|       1234|   Poland|  Germany|
 ------------- ------------- ----------- --------- --------- 

CodePudding user response:

If you don't mind using sql statement to achieve it, you can try:

df1.createOrReplaceTempView('tmp1')
df2.createOrReplaceTempView('tmp2')
sql = """
    select a.MAIL,a.STARTLOCATION,a.ENDLOCATION,b.LOCATION as LOCATION1,c.LOCATION as LOCATION2
    from tmp2 a join tmp1 b on a.STARTLOCATION=b.PIN
        join tmp1 c on a.ENDLOCATION=c.PIN
"""
df = spark.sql(sql)
df.show(truncate=False)
  •  Tags:  
  • Related