I have a datafile with one VALUE column and multiple dummy variables representing TYPES. I have copied a short example below. I need the average of each type (which I can get) with a column with the named type (which I don't seem to be able to get). Pointers would be welcome.
import pandas as pd
data = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
'manager': [1,1,1,0,0,0,0,0,0], 'foreman': [0,0,0,1,1,1,0,0,0],
'worker': [0,0,0,0,0,0,1,1,1]}
df = pd.DataFrame(data=data)
df
This is my input data.
salary manager foreman worker
0 50000 1 0 0
1 45000 1 0 0
2 55000 1 0 0
3 40000 0 1 0
4 35000 0 1 0
5 45000 0 1 0
6 30000 0 0 1
7 25000 0 0 1
8 35000 0 0 1
I can get the average, like this, but not consolidate the three dummy vars into one categorical column:
print(df.groupby(['manager','foreman','worker']).mean().reset_index())
manager foreman worker salary
0 0 0 1 30000
1 0 1 0 40000
2 1 0 0 50000
I would like to have something that looks like this:
need = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
'type': ['manager','manager','manager','foreman','foreman','foreman','worker','worker','worker']}
df2 = pd.DataFrame(data=need)
df2
salary type
0 50000 manager
1 45000 manager
2 55000 manager
3 40000 foreman
4 35000 foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker
I can do this simple example by hand. The result looks like this, which is ultimately where I will end up:
pay = {'type' : ['manager','foreman','worker'], 'avg_pay': [50000,40000,30000]}
df1 = pd.DataFrame(data=pay)
df1
type avg_pay
0 manager 50000
1 foreman 40000
2 worker 30000
Can't seem to find any documentation on how to "undummy" variables. How do I do this?
CodePudding user response:
Solutions if always only one 1
per row:
Use DataFrame.melt
with ignore_index=False
and var_name='type'
parameter, last filter by 1
in DataFrame.loc
with DataFrame.pop
for remove column value
:
df = (df.melt('salary', ignore_index=False, var_name='type')
.loc[lambda x: x.pop('value').eq(1)])
print (df)
salary type
0 50000 manager
1 45000 manager
2 55000 manager
3 40000 foreman
4 35000 foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker
Or:
s = df.drop('salary', axis=1).stack()
df = df[['salary']].join(s[s.eq(1)].index.to_frame().droplevel(1)[1].rename('type'))
print (df)
salary type
0 50000 manager
1 45000 manager
2 55000 manager
3 40000 foreman
4 35000 foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker
Solution if possible multiple 1
per rows with DataFrame.dot
by columns names for separate values by ,
:
df1 = df.set_index('salary')
df = df1.eq(1).dot(df1.columns ',').str[:-1].reset_index(name='type')
print (df)
salary type
0 50000 manager
1 45000 manager
2 55000 manager
3 40000 foreman
4 35000 foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker
Testing data with multiple 1
:
data = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
'manager': [1,1,1,0,1,0,0,0,0], 'foreman': [0,0,0,1,1,1,0,0,0],
'worker': [1,0,0,1,0,0,1,1,1]}
df = pd.DataFrame(data=data)
print (df)
salary manager foreman worker
0 50000 1 0 1
1 45000 1 0 0
2 55000 1 0 0
3 40000 0 1 1
4 35000 1 1 0
5 45000 0 1 0
6 30000 0 0 1
7 25000 0 0 1
8 35000 0 0 1
df1 = df.set_index('salary')
df = df1.eq(1).dot(df1.columns ',').str[:-1].reset_index(name='type')
print (df)
salary type
0 50000 manager,worker
1 45000 manager
2 55000 manager
3 40000 foreman,worker
4 35000 manager,foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker
CodePudding user response:
We can use pd.DataFrame.pipe
for this purpose. Special thanks to jezrael for inspiration we can make the code more concise:
(df.melt(id_vars='salary', var_name='type')
.pipe(lambda d: d.loc[d['value'].eq(1)].drop('value', axis=1)))
salary type
0 50000 manager
1 45000 manager
2 55000 manager
12 40000 foreman
13 35000 foreman
14 45000 foreman
24 30000 worker
25 25000 worker
26 35000 worker
CodePudding user response:
If you always have a maximum of one 1 per row, pandas 1.5 introduced the pandas.from_dummies
function:
df[['salary']].join(pd.from_dummies(df.drop(columns='salary'))[''].rename('type'))
# or
pd.DataFrame({'salary': df['salary'],
'type': pd.from_dummies(df.drop(columns='salary'))['']})
NB. if one of the rows can have only 0s, add the default_category=np.nan
parameter to from_dummies
.
Output:
salary type
0 50000 manager
1 45000 manager
2 55000 manager
3 40000 foreman
4 35000 foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker