I have two dfs(500x100 & 1300x2) and want to create a new column in the first one with which categories that occur on each row. To achieve this I need to fetch the category associated with the column name from second df. There might be several categories on same row.
df = pd.DataFrame({'apple': [0, 0, 1, 0],
'strawberries': [0, 1, 1, 0],
'cucumber': [1, 1, 0, 0],
'hawthorn': [0, 1, 0, 1]
})
df2 = pd.DataFrame({'storage': ['apple', 'strawberries', 'cucumber', 'hawthorn'],
'category': ['fruits', 'berries', 'vegetables', 'berries']
})
I've found two potential solutions which both aims to fetch value from dict when value of row is != 0:
df2_dict = dict(zip(df2['storage'], df2['category']))
df['categories'] = pd.Series(df.columns[np.where(df!=0)[1]]).map(df2_dict)
|
df['categories'] = df.apply(lambda s: ', '.join(s.index[s.eq(1)]), axis = 1).map(df2_dict)
These works to some extent but for some reason only give me results on about 1/10 of the rows. Desired output would be:
df = pd.DataFrame({'apple': [0, 0, 1, 0],
'strawberries': [0, 1, 1, 0],
'cucumber': [1, 1, 0, 0],
'hawthorn': [0, 1, 0, 1],
'categories': ['vegetables', 'berries, vegetables, berries',
'fruits, berries', 'berries' ]})
As of now column names are keys in dict. FYI the columns are dummies so only 0|1 in them.
Appreciate any smart solutions to this. xoxo
CodePudding user response:
there might be easier ways of doing this but this works i think :)
df = pd.DataFrame({'apple': [0, 0, 1, 0],
'strawberries': [0, 1, 1, 0],
'cucumber': [1, 1, 0, 0],
'hawthorn': [0, 1, 0, 1]})
df2 = pd.DataFrame({'storage': ['apple', 'strawberries', 'cucumber', 'hawthorn'],
'category': ['fruits', 'berries', 'vegetables', 'berries']})
def cateogory (row):
result = []
for column in list(df.columns) :
if row[column] == 1 :
result.append (df2.loc[df2['storage'] == column]["category"])
return [item for sublist in result for item in sublist]
df['category'] = df.apply(lambda row : cateogory(row) , axis=1 )
Result :
apple strawberries cucumber hawthorn category
0 0 0 1 0 [vegetables]
1 0 1 1 1 [berries, vegetables, berries]
2 1 1 0 0 [fruits, berries]
3 0 0 0 1 [berries]
btw edited your example, there were some mistakes in it
Edit : corrected i think !
CodePudding user response:
df.apply(lambda x:x.idxmax(), axis=1).map(dict(df2.values))
output:
0 vegetables
1 berries
2 fruits
3 greens
dtype: object
make result to category
column
If there are df
values greater than 1, change df
's value to 0 or 1
df.gt(0).apply(lambda x:x.idxmax(), axis=1).map(dict(df2.values))
same result
If this is not result you want, draw desired output of the example.
after edit question
df.apply(lambda x: ','.join(x[x>0].index.map(dict(df2.values))), axis=1)
result:
0 vegetables
1 berries,vegetables,greens
2 fruits,berries
3 greens
dtype: object
make result to category
column