Let's suppose I have a dataset like this:
ID | Department | Level |
---|---|---|
1 | HR | senior |
2 | IT | junior |
3 | IT | middle |
4 | IT | middle |
5 | IT | senior |
6 | HR | middle |
7 | Design | middle |
8 | Design | middle |
How do i transform this dataset using value_counts() towards every column, so that end result looks like this:
Department | Department_Count | Level | Level_Count |
---|---|---|---|
HR | 2 | middle | 1 |
senior | 1 | ||
IT | 4 | junior | 1 |
middle | 2 | ||
senior | 1 | ||
Design | 2 | middle | 2 |
CodePudding user response:
You can do this by creating 2 groupby dataframes, one for each of the counts, and merging them together.
Department counts:
dept = df.groupby('Department', as_index=False).count()[['Department', 'ID']]
dept = dept.rename(columns = {'ID':'Department_Count'})
Department ID
0 Design 2
1 HR 2
2 IT 4
Level counts:
level = df.groupby(['Department', 'Level'], as_index=False).count()
level = level.rename(columns = {'ID':'Level_Count'})
Department Level Level_Count
0 Design middle 2
1 HR middle 1
2 HR senior 1
3 IT junior 1
4 IT middle 2
5 IT senior 1
Then merge the two together on Department
df_out = dept.merge(level, on='Department')
Department ID Level Level_Count
0 Design 2 middle 2
1 HR 2 middle 1
2 HR 2 senior 1
3 IT 4 junior 1
4 IT 4 middle 2
5 IT 4 senior 1
To get the Nans in the Department and ID columns as requested, you could use .loc to find the duplicated rows in those columns, and replace with Nan (will need to import numpy as np
):
df_out.loc[df_out[['Department', 'ID']].duplicated(), ['Department', 'ID']] = np.nan
Department ID Level Level_Count
0 Design 2.0 middle 2
1 HR 2.0 middle 1
2 NaN NaN senior 1
3 IT 4.0 junior 1
4 NaN NaN middle 2
5 NaN NaN senior 1