Home > Enterprise >  Filter pyspark DataFrame by string match
Filter pyspark DataFrame by string match

Time:04-11

i would like check substring match between comments and keyword column and find if anyone of the keywords present in that particular row.

input

   name               comments                keywords
0  paul      account is active  active,activated,activ
1  john   account is activated  active,activated,activ
2   max  account is activateds  active,activated,activ

expected output

match 
True
True
True

CodePudding user response:

The most efficient here is to loop, you can use set intersection:

df['match'] = [set(c.split()).intersection(k.split(',')) > set()
               for c,k in zip(df['comments'], df['keywords'])]

Output:

   name               comments                keywords  match
0  paul      account is active  active,activated,activ   True
1  john   account is activated  active,activated,activ   True
2   max  account is activateds  active,activated,activ  False

Used input:

df = pd.DataFrame({'name': ['paul' , 'john' , 'max'],
                   'comments': ['account is active' ,'account is activated','account is activateds'],
                   'keywords': ['active,activated,activ', 'active,activated,activ', 'active,activated,activ']})

With a minor variation you could check for substring match ("activ" would match "activateds"):

df['substring'] = [any(w in c for w in k.split(','))
                   for c,k in zip(df['comments'], df['keywords'])]

Output:

   name               comments                keywords  substring
0  paul      account is active  active,activated,activ       True
1  john   account is activated  active,activated,activ       True
2   max  account is activateds  active,activated,activ       True

CodePudding user response:

Use:

keys = ('|').join([f'({x})' for x in df['keywords'].iloc[0].split(',')])
df['comments'].str.contains(keys)

Output:

0    True
1    True
2    True
Name: comments, dtype: bool
  • Related