For the two dataframes below, I'm trying to see if name1
in df1
contains name2
in df2
. How to achieve this? I'm thinking of doing a join but there's no join key. Can I specify something like if name2 contains name1 then join them
, and result in the output below? Many thanks for your help
df1:
Name1 Colour
Lisa ('blue','yellow')
Bart ('pink', 'yellow')
Homer ('green', 'black')
Maggie ('yellow','orange','blue')
df2
Name2 Age
Lisa S. 5
Bart Simpsons 8
Bob 10
Ben 30
expected output:
Name2 Age Name1 colour
Lisa S. 5 Lisa ('blue','yellow')
Bart Simpsons 8. Bart ('pink', 'yellow')
CodePudding user response:
Update
As @samkart mentioned, we can use direct .crossJoin()
. Updated the solution with that.
As there is no common key to join, you may have to perform cross join and compare each value against the rest. This can be done by introducing a "dummy_key". Then just filter the rows by string contains()
:
df1 = spark.createDataFrame([["Lisa","('blue','yellow')"], ["Bart","('pink', 'yellow')"], ["Homer","('green', 'black')"], ["Maggie","('yellow','orange','blue')"]], ["Name1","Colour"])
df2 = spark.createDataFrame([["Lisa S.",5], ["Bart Simpsons",8], ["Bob",10], ["Ben",30]], ["Name2","Age"])
result_df = df2.crossJoin(df1)
result_df = result_df.filter(F.col("Name2").contains(F.col("Name1")))
[Out]:
------------- --- ----- ------------------
| Name2|Age|Name1| Colour|
------------- --- ----- ------------------
| Lisa S.| 5| Lisa| ('blue','yellow')|
|Bart Simpsons| 8| Bart|('pink', 'yellow')|
------------- --- ----- ------------------