Home > Blockchain >  Pandas, merge multiple dummy variables into one column by name
Pandas, merge multiple dummy variables into one column by name

Time:01-22

I have a datafile with one VALUE column and multiple dummy variables representing TYPES. I have copied a short example below. I need the average of each type (which I can get) with a column with the named type (which I don't seem to be able to get). Pointers would be welcome.

import pandas as pd
data = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
        'manager': [1,1,1,0,0,0,0,0,0], 'foreman': [0,0,0,1,1,1,0,0,0], 
        'worker': [0,0,0,0,0,0,1,1,1]}
df = pd.DataFrame(data=data)
df

This is my input data.

salary  manager foreman worker
0   50000   1   0   0
1   45000   1   0   0
2   55000   1   0   0
3   40000   0   1   0
4   35000   0   1   0
5   45000   0   1   0
6   30000   0   0   1
7   25000   0   0   1
8   35000   0   0   1

I can get the average, like this, but not consolidate the three dummy vars into one categorical column:

print(df.groupby(['manager','foreman','worker']).mean().reset_index())

manager  foreman  worker  salary
0        0        0       1   30000
1        0        1       0   40000
2        1        0       0   50000

I would like to have something that looks like this:

need = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
        'type': ['manager','manager','manager','foreman','foreman','foreman','worker','worker','worker']}
df2 = pd.DataFrame(data=need)
df2

salary  type
0   50000   manager
1   45000   manager
2   55000   manager
3   40000   foreman
4   35000   foreman
5   45000   foreman
6   30000   worker
7   25000   worker
8   35000   worker

I can do this simple example by hand. The result looks like this, which is ultimately where I will end up:

pay = {'type' : ['manager','foreman','worker'], 'avg_pay': [50000,40000,30000]}
df1 = pd.DataFrame(data=pay)
df1

type    avg_pay
0   manager 50000
1   foreman 40000
2   worker  30000

Can't seem to find any documentation on how to "undummy" variables. How do I do this?

CodePudding user response:

Solutions if always only one 1 per row:

Use DataFrame.melt with ignore_index=False and var_name='type' parameter, last filter by 1 in DataFrame.loc with DataFrame.pop for remove column value:

df = (df.melt('salary', ignore_index=False, var_name='type')
        .loc[lambda x: x.pop('value').eq(1)])
print (df)
   salary     type
0   50000  manager
1   45000  manager
2   55000  manager
3   40000  foreman
4   35000  foreman
5   45000  foreman
6   30000   worker
7   25000   worker
8   35000   worker

Or:

s = df.drop('salary', axis=1).stack()
df = df[['salary']].join(s[s.eq(1)].index.to_frame().droplevel(1)[1].rename('type'))
print (df)
   salary     type
0   50000  manager
1   45000  manager
2   55000  manager
3   40000  foreman
4   35000  foreman
5   45000  foreman
6   30000   worker
7   25000   worker
8   35000   worker
    

Solution if possible multiple 1 per rows with DataFrame.dot by columns names for separate values by ,:

df1 = df.set_index('salary')
df = df1.eq(1).dot(df1.columns   ',').str[:-1].reset_index(name='type')
print (df)
   salary     type
0   50000  manager
1   45000  manager
2   55000  manager
3   40000  foreman
4   35000  foreman
5   45000  foreman
6   30000   worker
7   25000   worker
8   35000   worker

Testing data with multiple 1:

data = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
        'manager': [1,1,1,0,1,0,0,0,0], 'foreman': [0,0,0,1,1,1,0,0,0], 
        'worker': [1,0,0,1,0,0,1,1,1]}
df = pd.DataFrame(data=data)
print (df)
   salary  manager  foreman  worker
0   50000        1        0       1
1   45000        1        0       0
2   55000        1        0       0
3   40000        0        1       1
4   35000        1        1       0
5   45000        0        1       0
6   30000        0        0       1
7   25000        0        0       1
8   35000        0        0       1

df1 = df.set_index('salary')
df = df1.eq(1).dot(df1.columns   ',').str[:-1].reset_index(name='type')
print (df)
   salary             type
0   50000   manager,worker
1   45000          manager
2   55000          manager
3   40000   foreman,worker
4   35000  manager,foreman
5   45000          foreman
6   30000           worker
7   25000           worker
8   35000           worker

CodePudding user response:

We can use pd.DataFrame.pipe for this purpose. Special thanks to jezrael for inspiration we can make the code more concise:

(df.melt(id_vars='salary', var_name='type')
 .pipe(lambda d: d.loc[d['value'].eq(1)].drop('value', axis=1)))

    salary     type
0    50000  manager
1    45000  manager
2    55000  manager
12   40000  foreman
13   35000  foreman
14   45000  foreman
24   30000   worker
25   25000   worker
26   35000   worker

CodePudding user response:

If you always have a maximum of one 1 per row, pandas 1.5 introduced the pandas.from_dummies function:

df[['salary']].join(pd.from_dummies(df.drop(columns='salary'))[''].rename('type'))

# or
pd.DataFrame({'salary': df['salary'],
              'type': pd.from_dummies(df.drop(columns='salary'))['']})

NB. if one of the rows can have only 0s, add the default_category=np.nan parameter to from_dummies.

Output:

   salary     type
0   50000  manager
1   45000  manager
2   55000  manager
3   40000  foreman
4   35000  foreman
5   45000  foreman
6   30000   worker
7   25000   worker
8   35000   worker
  • Related