I am working on an employee attrition problem. I have columns for department and if the employee quit or not.
Since some departments have more employees working in them, I would assume that they account for the majority of attrition.
So I would like to graph the attrition rate per department. For example, out of the 1000 people that work in IT 200 quit (20% attrition rate).
What I have been doing is creating pivot tables and cross table to get the relevant information then calculating manually attrition rates. But I feel like this can be done more efficiently.
What I have currently is:
- Sample DataFrame
df= pd.DataFrame({'department': ['sales','sales','HR','sales','sales','HR','sales','R&D','HR','sales'],
'quit': ['yes','no','no','yes','no','yes','yes','no','yes','no'],
'employee_count': [1,1,1,1,1,1,1,1,1,1]})
- Pivot Table to find out how many employees are in each department
pivot = pd.pivot_table(df, values='employee_count',
columns=['department'], aggfunc=np.sum)
which gives output of
. | HR | R&D | Sales |
---|---|---|---|
employee_count | 3 | 1 | 6 |
- Crosstab to find the number of employees that quit per department
pivot2=pd.crosstab(values=df['employee_count'],
index=df['department'],
columns=df['quit'],
aggfunc=np.sum)
Quit | no | yes |
---|---|---|
HR | 1 | 2 |
R&D | 1 | nan |
Sales | 3 | 3 |
- Calculate Manually and Plot
import matplotlib.pyplot as plt
plt.rcParams['axes.facecolor'] = 'white'
fig = plt.figure()
fig.patch.set_facecolor('white')
names = ['HR','sales']
values = [66.66666667,50]
plt.figure()
plt.bar(names, values)
any help would be appreciated, thanks for your time
CodePudding user response:
Pandas offers to group by certain columns. In your case we can aggregate the results based on the department and whether someone quit. Then use a pivot to change the format of the table and finally calculate the total number of employees and quit rate:
# obtain the relevant results by grouping
agg_df = df.groupby(by=['department', 'quit']).agg({'employee_count': 'sum'})
agg_df.reset_index(drop=False, inplace=True)
# pivot the table for further usage
pivot_df = agg_df.pivot(columns=['quit'], index=['department'], values='employee_count')
pivot_df.fillna(0, inplace=True)
# calculate final statistics
pivot_df['total'] = pivot_df['yes'] pivot_df['no']
pivot_df['quit_rate'] = pivot_df['yes'] / pivot_df['total']