Consider the following dataframe:
df = pd.DataFrame(data={'String': list('abcde'), 'Fruit': ['apple', 'pineapple', 'banana', 'orange', 'kiwi']})
String Fruit
0 a apple
1 b pineapple
2 c banana
3 d orange
4 e kiwi
I want to check for each fruit if it contains a substring from the column string. The outcome should be something like this:
a b c d e
Fruit
apple True False False False True
pineapple True False False False True
banana True True False False False
orange True False False False True
kiwi False False False False False
Is this possible with some pandas built-in-method? My try was with a for-loop, like so, but this may be slow when the number of strings/fruits increases.
df_new = pd.DataFrame()
for i in range(len(df)):
df_new.loc[:, df.String[i]] = df.Fruit.str.contains(df.String[i])
df_new.set_index(df.Fruit)
CodePudding user response:
Here is non loop solution with get all substrings by column String
and then pass to MultiLabelBinarizer
with append missing columns with filling False
s:
pat = '|'.join(df['String'])
s = df['Fruit'].str.findall(pat)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = (pd.DataFrame(mlb.fit_transform(s),
columns=mlb.classes_,
index=df['Fruit'])
.astype(bool)
.reindex(df['String'], fill_value=False, axis=1))
print (df1)
String a b c d e
Fruit
apple True False False False True
pineapple True False False False True
banana True True False False False
orange True False False False True
kiwi False False False False False
pat = '|'.join(df['String'])
df = (df['Fruit'].str.findall(pat)
.str.join('|')
.str.get_dummies()
.astype(bool)
.reindex(df['String'], fill_value=False, axis=1)
.set_index(df['Fruit']))
print (df)
String a b c d e
Fruit
apple True False False False True
pineapple True False False False True
banana True True False False False
orange True False False False True
kiwi False False False False False
Another solution:
pat = (r'({})'.format('|'.join(df['String'])))
df1 = (pd.get_dummies(df['Fruit'].str.extractall(pat)[0], dtype=bool)
.groupby(level=0).max()
.reindex(index=df.index, columns=df['String'], fill_value=False)
.set_index(df['Fruit'])
.rename_axis(None, axis=1))
print (df1)
a b c d e
Fruit
apple True False False False True
pineapple True False False False True
banana True True False False False
orange True False False False True
kiwi False False False False False