Extract value associated with column name on non-zero rows-CodePudding

I have two dfs(500x100 & 1300x2) and want to create a new column in the first one with which categories that occur on each row. To achieve this I need to fetch the category associated with the column name from second df. There might be several categories on same row.

df = pd.DataFrame({'apple': [0, 0, 1, 0], 
'strawberries': [0, 1, 1, 0], 
'cucumber': [1, 1, 0, 0], 
'hawthorn': [0, 1, 0, 1]
})

df2 = pd.DataFrame({'storage': ['apple', 'strawberries', 'cucumber', 'hawthorn'],
'category': ['fruits', 'berries', 'vegetables', 'berries']
})

I've found two potential solutions which both aims to fetch value from dict when value of row is != 0:

df2_dict = dict(zip(df2['storage'], df2['category']))
df['categories'] = pd.Series(df.columns[np.where(df!=0)[1]]).map(df2_dict)
|
df['categories'] = df.apply(lambda s: ', '.join(s.index[s.eq(1)]), axis = 1).map(df2_dict)

These works to some extent but for some reason only give me results on about 1/10 of the rows. Desired output would be:

df = pd.DataFrame({'apple': [0, 0, 1, 0], 
'strawberries': [0, 1, 1, 0], 
'cucumber': [1, 1, 0, 0], 
'hawthorn': [0, 1, 0, 1],
'categories': ['vegetables', 'berries, vegetables, berries',
'fruits, berries', 'berries' ]})

As of now column names are keys in dict. FYI the columns are dummies so only 0|1 in them.

Appreciate any smart solutions to this. xoxo

CodePudding user response：

there might be easier ways of doing this but this works i think :)

df = pd.DataFrame({'apple': [0, 0, 1, 0], 
'strawberries': [0, 1, 1, 0], 
'cucumber': [1, 1, 0, 0], 
'hawthorn': [0, 1, 0, 1]})

df2 = pd.DataFrame({'storage': ['apple', 'strawberries', 'cucumber', 'hawthorn'],
'category': ['fruits', 'berries', 'vegetables', 'berries']})

def cateogory (row):
    result = []
    for column in list(df.columns) :
        if row[column] == 1 :
            result.append (df2.loc[df2['storage'] == column]["category"])
    return [item for sublist in result for item in sublist]

df['category']  = df.apply(lambda row :  cateogory(row) , axis=1 )

Result :

   apple  strawberries  cucumber  hawthorn                        category
0      0             0         1         0                    [vegetables]
1      0             1         1         1  [berries, vegetables, berries]
2      1             1         0         0               [fruits, berries]
3      0             0         0         1                       [berries]

btw edited your example, there were some mistakes in it

Edit : corrected i think !

CodePudding user response：

df.apply(lambda x:x.idxmax(), axis=1).map(dict(df2.values))

output:

0    vegetables
1       berries
2        fruits
3        greens
dtype: object

make result to category column

If there are df values greater than 1, change df's value to 0 or 1

df.gt(0).apply(lambda x:x.idxmax(), axis=1).map(dict(df2.values))

same result

If this is not result you want, draw desired output of the example.

after edit question

df.apply(lambda x: ','.join(x[x>0].index.map(dict(df2.values))), axis=1)

result:

0                   vegetables
1    berries,vegetables,greens
2               fruits,berries
3                       greens
dtype: object

make result to category column