Home > Mobile >  Show which strings contained in a column
Show which strings contained in a column

Time:10-11

How could I create the column called "reason", which shows which strings matched?

match_d = {"col_a":["green", "purple"], "col_b":["weak", "stro", "strong"],...}
df
    fruit   col_a     col_b
0   apple   yellow     NaN
1   pear    blue       NaN
2   banana  green      strong
3   cherry  green      heavy
4   grapes  brown      light
...

Expected Output

    fruit   col_a     col_b      reason
0   apple   yellow     NaN        NaN
1   pear    blue       NaN        NaN
2   banana  green      strong     col_a:["green"], col_b:["stro", "strong"]
3   cherry  green      heavy      col_a:["green"] 
4   grapes  brown      light       

CodePudding user response:

Use nested list comprehension for join matched values by match_d and then join values with columns names if not empty strings:

match_d = {"col_a":["green", "purple"], "col_b":["weak", "stro", "strong"]}

cols = list(match_d.keys())
L = [[','.join(z for z in match_d[x] if pd.notna(y) and z in y) 
      for y in df[x]]  for x in df[cols]]

df['reason'] = [np.nan if ''.join(x) == '' else ';'.join(f'{a}:[{b}]' 
                for a, b in zip(cols, x) if b != '') 
                for x in zip(*L)]
print (df)
    fruit   col_a   col_b                             reason
0   apple  yellow     NaN                                NaN
1    pear    blue     NaN                                NaN
2  banana   green  strong  col_a:[green];col_b:[stro,strong]
3  cherry   green   heavy                      col_a:[green]
4  grapes   brown   light                                NaN

Alternative solution with .apply:

match_d = {"col_a":["green", "purple"], "col_b":["weak", "stro", "strong"]}

cols = list(match_d.keys())
df1 = df[cols].apply(lambda x: [','.join(z for z in match_d[x.name] 
                                         if pd.notna(y) and z in y) for y in x])

df['reason'] = [np.nan if ''.join(x) == '' else ';'.join(f'{a}:[{b}]' 
                for a, b in zip(cols, x) if b != '') 
                for x in df1.to_numpy()]
  • Related