I have a dataframe cities
country cities
UK [London,London Luton, Luton]
UK [London,London Gatwick, Gatwick]
and reference dataframe airports:
city airport coords
London London Luton 12.51
London London Gatwick 100.32
I want to match list of values in cities column with airport column from reference df. If it matches, get the relevance airport names and co ordinates from the reference df.
Sample of the desired output of cities df:
country cities airport coords
UK [London,London Luton, Luton] London Luton 12.51
UK [London,London Gatwick, Gatwick] London Gatwick 100.32
Explanation:
[London,**London Luton**, Luton]
from cities matches with **London Luton**
in airport
I have explored some options but couldn't really get there. Can anyone help? Thanks
CodePudding user response:
You can utilise array_contains to mark rows which contain your search string and finally filter on the rows which returns True
Data Preparation
d1 = {
'cities':[
['London','London Luton', 'Luton'],
['London','London Gatwick', 'Gatwick']
],
'country':['UK','UK']
}
d2 = {
'country':['UK','UK'],
'city':['London','London'],
'airport':['London Luton','London Gatwick'],
'coords':[12.51,100.32]
}
sparkDF1 = sql.createDataFrame(pd.DataFrame(d1))
sparkDF2 = sql.createDataFrame(pd.DataFrame(d2))
sparkDF1.show(truncate=False)
--------------------------------- -------
|cities |country|
--------------------------------- -------
|[London, London Luton, Luton] |UK |
|[London, London Gatwick, Gatwick]|UK |
--------------------------------- -------
sparkDF2.show()
------- ------ -------------- ------
|country| city| airport|coords|
------- ------ -------------- ------
| UK|London| London Luton| 12.51|
| UK|London|London Gatwick|100.32|
------- ------ -------------- ------
Array Contains
finalDF = sparkDF1.join(sparkDF2
,sparkDF1['country'] == sparkDF2['country']
,'inner'
).select(sparkDF1['*'],sparkDF2['airport'])
finalDF = finalDF.withColumn('flag',F.array_contains( F.col('cities'),F.col('airport') ) )
finalDF.filter(F.col('flag') == True).show(truncate=False)
--------------------------------- ------- -------------- ----
|cities |country|airport |flag|
--------------------------------- ------- -------------- ----
|[London, London Luton, Luton] |UK |London Luton |true|
|[London, London Gatwick, Gatwick]|UK |London Gatwick|true|
--------------------------------- ------- -------------- ----
CodePudding user response:
You could create a new column on the cities table with the airport name.
Then you could simply merge the two tables on the airport
column.
Using the prep core from @Vaebhav:
sparkDF1 = sparkDF1.withColumn("airport", F.col("cities")[1])
sparkDF1.show(truncate=False)
--------------------------------- ------- --------------
|cities |country|airport |
--------------------------------- ------- --------------
|[London, London Luton, Luton] |UK |London Luton |
|[London, London Gatwick, Gatwick]|UK |London Gatwick|
--------------------------------- ------- --------------
finalDF = sparkDF1.join(sparkDF2, on="airport", how="right")
finalDF.show(truncate=False)
-------------- --------------------------------- ------- ------- ------ ------
|airport |cities |country|country|city |coords|
-------------- --------------------------------- ------- ------- ------ ------
|London Luton |[London, London Luton, Luton] |UK |UK |London|12.51 |
|London Gatwick|[London, London Gatwick, Gatwick]|UK |UK |London|100.32|
-------------- --------------------------------- ------- ------- ------ ------