String matching between two pyspark dataframe columns-CodePudding

I have a dataframe cities

country      cities
  UK        [London,London Luton, Luton]
  UK        [London,London Gatwick, Gatwick]

and reference dataframe airports:

city         airport            coords
London        London Luton       12.51
London        London Gatwick     100.32

I want to match list of values in cities column with airport column from reference df. If it matches, get the relevance airport names and co ordinates from the reference df.

Sample of the desired output of cities df:

country      cities                                airport            coords
  UK        [London,London Luton, Luton]           London Luton       12.51
  UK        [London,London Gatwick, Gatwick]       London Gatwick     100.32

Explanation:

[London,**London Luton**, Luton] from cities matches with **London Luton** in airport

I have explored some options but couldn't really get there. Can anyone help? Thanks

CodePudding user response：

You can utilise array_contains to mark rows which contain your search string and finally filter on the rows which returns True

Data Preparation

d1 = {
    'cities':[
                 ['London','London Luton', 'Luton'],
                 ['London','London Gatwick', 'Gatwick']
            ],
    'country':['UK','UK']
    
    
}

d2 = {
    'country':['UK','UK'],
    'city':['London','London'],
    'airport':['London Luton','London Gatwick'],
    'coords':[12.51,100.32]
}


sparkDF1 = sql.createDataFrame(pd.DataFrame(d1))
sparkDF2 = sql.createDataFrame(pd.DataFrame(d2))

sparkDF1.show(truncate=False)

 --------------------------------- ------- 
|cities                           |country|
 --------------------------------- ------- 
|[London, London Luton, Luton]    |UK     |
|[London, London Gatwick, Gatwick]|UK     |
 --------------------------------- ------- 

sparkDF2.show()

 ------- ------ -------------- ------ 
|country|  city|       airport|coords|
 ------- ------ -------------- ------ 
|     UK|London|  London Luton| 12.51|
|     UK|London|London Gatwick|100.32|
 ------- ------ -------------- ------

Array Contains

finalDF = sparkDF1.join(sparkDF2
                       ,sparkDF1['country'] == sparkDF2['country']
                       ,'inner'
                    ).select(sparkDF1['*'],sparkDF2['airport'])


finalDF = finalDF.withColumn('flag',F.array_contains( F.col('cities'),F.col('airport') ) )


finalDF.filter(F.col('flag') == True).show(truncate=False)

 --------------------------------- ------- -------------- ---- 
|cities                           |country|airport       |flag|
 --------------------------------- ------- -------------- ---- 
|[London, London Luton, Luton]    |UK     |London Luton  |true|
|[London, London Gatwick, Gatwick]|UK     |London Gatwick|true|
 --------------------------------- ------- -------------- ----

CodePudding user response：

You could create a new column on the cities table with the airport name. Then you could simply merge the two tables on the airport column.

Using the prep core from @Vaebhav:

sparkDF1 = sparkDF1.withColumn("airport", F.col("cities")[1])

sparkDF1.show(truncate=False)

 --------------------------------- ------- -------------- 
|cities                           |country|airport       |
 --------------------------------- ------- -------------- 
|[London, London Luton, Luton]    |UK     |London Luton  |
|[London, London Gatwick, Gatwick]|UK     |London Gatwick|
 --------------------------------- ------- -------------- 


finalDF = sparkDF1.join(sparkDF2, on="airport", how="right")

finalDF.show(truncate=False)

 -------------- --------------------------------- ------- ------- ------ ------ 
|airport       |cities                           |country|country|city  |coords|
 -------------- --------------------------------- ------- ------- ------ ------ 
|London Luton  |[London, London Luton, Luton]    |UK     |UK     |London|12.51 |
|London Gatwick|[London, London Gatwick, Gatwick]|UK     |UK     |London|100.32|
 -------------- --------------------------------- ------- ------- ------ ------