Home > Mobile >  Check if column contains substring from column
Check if column contains substring from column

Time:10-05

Consider the following dataframe:

df = pd.DataFrame(data={'String': list('abcde'), 'Fruit': ['apple', 'pineapple', 'banana', 'orange', 'kiwi']})
  String      Fruit
0      a      apple
1      b  pineapple
2      c     banana
3      d     orange
4      e       kiwi

I want to check for each fruit if it contains a substring from the column string. The outcome should be something like this:

               a      b      c      d      e
Fruit                                       
apple       True  False  False  False   True
pineapple   True  False  False  False   True
banana      True   True  False  False  False
orange      True  False  False  False   True
kiwi       False  False  False  False  False

Is this possible with some pandas built-in-method? My try was with a for-loop, like so, but this may be slow when the number of strings/fruits increases.

df_new = pd.DataFrame()
for i in range(len(df)):
    df_new.loc[:, df.String[i]] = df.Fruit.str.contains(df.String[i])
df_new.set_index(df.Fruit)

CodePudding user response:

Here is non loop solution with get all substrings by column String and then pass to MultiLabelBinarizer with append missing columns with filling Falses:

pat = '|'.join(df['String'])
s = df['Fruit'].str.findall(pat)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = (pd.DataFrame(mlb.fit_transform(s),
                  columns=mlb.classes_,
                  index=df['Fruit'])
        .astype(bool)
        .reindex(df['String'], fill_value=False, axis=1))
print (df1)
String         a      b      c      d      e
Fruit                                       
apple       True  False  False  False   True
pineapple   True  False  False  False   True
banana      True   True  False  False  False
orange      True  False  False  False   True
kiwi       False  False  False  False  False

pat = '|'.join(df['String'])
df = (df['Fruit'].str.findall(pat)
                 .str.join('|')
                 .str.get_dummies()
                 .astype(bool)
                 .reindex(df['String'], fill_value=False, axis=1)
                 .set_index(df['Fruit']))
print (df)
String         a      b      c      d      e
Fruit                                       
apple       True  False  False  False   True
pineapple   True  False  False  False   True
banana      True   True  False  False  False
orange      True  False  False  False   True
kiwi       False  False  False  False  False

Another solution:

pat = (r'({})'.format('|'.join(df['String']))) 

df1 = (pd.get_dummies(df['Fruit'].str.extractall(pat)[0], dtype=bool)
         .groupby(level=0).max()
         .reindex(index=df.index, columns=df['String'], fill_value=False)
         .set_index(df['Fruit'])
         .rename_axis(None, axis=1))

print (df1)
               a      b      c      d      e
Fruit                                       
apple       True  False  False  False   True
pineapple   True  False  False  False   True
banana      True   True  False  False  False
orange      True  False  False  False   True
kiwi       False  False  False  False  False
  • Related