I have this DataFrame:
classification text apple banana peach grape
["apple","grape"] anytext NaN NaN NaN NaN
How can I check if the column name is in the classification
column, to get this:
classification text apple banana peach grape
["apple","grape"] anytext 1 0 0 1
Data:
{'classification': [['apple', 'grape']],
'text': ['anytext'],
'apple': [nan],
'banana': [nan],
'peach': [nan],
'grape': [nan]}
CodePudding user response:
You could apply a lambda function on "classification" that looks checks if an item in it exists as a column name:
cols = ['apple','banana','peach','grape']
df[cols] = df['classification'].apply(lambda x: [1 if col in x else 0 for col in cols]).tolist()
Another option is to explode
stack
fillna
to get a blank Series where the MultiIndex consists of the index, "classification" and column names of df
. Then evaluate if any item in "classification" exists as a column name, create a Series, unstack
groupby
sum
to build a DataFrame to assign back to df
:
tmp = df.explode('classification')
s = tmp.set_index([tmp.index, tmp['classification']])[cols].fillna(0).stack()
s = pd.Series((s.index.get_level_values(1)==s.index.get_level_values(2)).astype(int), index=s.index)
df[cols] = s.unstack().groupby(level=0).sum()
Yet even simpler is to use explode
pd.get_dummies
groupby
sum
to get the items in "classification" as dummy variables, then update df
with it using fillna
:
df[cols] = df[cols].fillna(pd.get_dummies(df['classification'].explode()).groupby(level=0).sum()).fillna(0)
Output:
classification text apple banana peach grape
0 [apple, grape] anytext 1 0 0 1