Home > Enterprise >  Pyspark- how to check one data frame column contains string from another dataframe
Pyspark- how to check one data frame column contains string from another dataframe

Time:11-10

For the two dataframes below, I'm trying to see if name1 in df1 contains name2 in df2. How to achieve this? I'm thinking of doing a join but there's no join key. Can I specify something like if name2 contains name1 then join them, and result in the output below? Many thanks for your help

df1:
Name1   Colour 
Lisa    ('blue','yellow')
Bart       ('pink', 'yellow')
Homer     ('green', 'black')
Maggie      ('yellow','orange','blue')

df2
Name2          Age 
Lisa S.        5
Bart Simpsons  8
Bob            10
Ben            30

expected output:
Name2          Age    Name1   colour
Lisa S.        5      Lisa    ('blue','yellow')
Bart Simpsons  8.     Bart       ('pink', 'yellow')

CodePudding user response:

Update As @samkart mentioned, we can use direct .crossJoin(). Updated the solution with that.


As there is no common key to join, you may have to perform cross join and compare each value against the rest. This can be done by introducing a "dummy_key". Then just filter the rows by string contains():

df1 = spark.createDataFrame([["Lisa","('blue','yellow')"], ["Bart","('pink', 'yellow')"], ["Homer","('green', 'black')"], ["Maggie","('yellow','orange','blue')"]], ["Name1","Colour"])
df2 = spark.createDataFrame([["Lisa S.",5], ["Bart Simpsons",8], ["Bob",10], ["Ben",30]], ["Name2","Age"])

result_df = df2.crossJoin(df1)
result_df = result_df.filter(F.col("Name2").contains(F.col("Name1")))

[Out]:
 ------------- --- ----- ------------------ 
|        Name2|Age|Name1|            Colour|
 ------------- --- ----- ------------------ 
|      Lisa S.|  5| Lisa| ('blue','yellow')|
|Bart Simpsons|  8| Bart|('pink', 'yellow')|
 ------------- --- ----- ------------------ 
  • Related