Say in my dataset I have 3 nominal/categorical variables-Year(2 possible unique values), Gender(2 possible unique values), Country(2 possibleunique values) and 2 numerical variables- Work exp in years and Salary. Thus we can make 8 (2x2X2) possible combinations of categorical variables. However, my data does not have all the combinations but lets say 5 out of 8 (see the data example below).
Example: Data (Obtained after pandas group by)- 5 group combinations
df.groupby(['Years','Gender','Country'])[['Salary','Work ex']].mean()
df.reset_index(inpalce=True)
Years | Gender | Country | Salary | Work ex |
---|---|---|---|---|
2010 | Male | USA | 50 | 2 |
2011 | Female | India | 30 | 1 |
2011 | Male | India | 10 | 3 |
2011 | Male | USA | 50 | 2 |
2011 | Female | USA | 80 | 2 |
Now what I want is to have rows with all the combinations of categorical variables i.e. 8 rows, thus, for the new 3 rows the numercical variables will have null values and for rest 5 it would have values as shown below
Years | Gender | Country | Mean Salary | Mean Work ex |
---|---|---|---|---|
2010 | Male | USA | 50 | 2 |
2010 | Male | India | NA | NA |
2010 | Female | USA | NA | NA |
2010 | Female | India | NA | NA |
2011 | Female | India | 30 | 1 |
2011 | Male | India | 10 | 3 |
2011 | Male | USA | 50 | 2 |
2011 | Female | USA | 80 | 2 |
PS: My original data had years, gender, country, salary, work exp as variables. I have grouped (on years,gender,country) and summarised (on work ex and salary).That led to data above with only 5 different group combinatins out of 8. Now, I want to add the rest of the possible groups (3 groups) with null values.
CodePudding user response:
Assuming you achieved step 1 and lets call it df_grp
.
Then create a dataframe with all possible combination of ['Years', 'Gender', 'Country'] like:
df_all = pd.MultiIndex.from_product([df_grp['Years'].unique(), df_grp['Gender'
].unique(), df_grp['Country'
].unique()]).to_frame().reset_index(drop=True)
df_all.columns = ['Years', 'Gender', 'Country']
Then do an outer merge with df_grp
out = df_all.merge(df_grp, on=['Years', 'Gender', 'Country'], how = 'outer')
print(out):
Years Gender Country Mean Salary Mean Work ex.
0 2010 Male India NaN NaN
1 2010 Male USA 50.0 1.5
2 2010 Female India NaN NaN
3 2010 Female USA NaN NaN
4 2011 Male India 10.0 3.0
5 2011 Male USA 50.0 2.0
6 2011 Female India 30.0 1.0
7 2011 Female USA 80.0 2.0
CodePudding user response:
Make sure the variables are categories, then use pd.groupby()
:
df = pd.DataFrame({'Years': {0: 2010, 1: 2011, 2: 2011, 3: 2011, 4: 2011, 5: 2010},
'Gender': {0: 'Male', 1: 'Female', 2: 'Male', 3: 'Male', 4: 'Female', 5: 'Male'},
'Country': {0: 'USA', 1: 'India', 2: 'India', 3: 'USA', 4: 'USA', 5: 'USA'},
'Salary': {0: 50, 1: 30, 2: 10, 3: 50, 4: 80, 5: 50},
'Work ex': {0: 2, 1: 1, 2: 3, 3: 2, 4: 2, 5: 1}})
df[['Years', 'Gender', 'Country']] = df[['Years', 'Gender', 'Country']].astype('category')
df.groupby(['Years', 'Gender', 'Country'])[['Salary', 'Work ex']].mean().reset_index()
Output:
Years Gender Country Salary Work ex
0 2010 Female India NaN NaN
1 2010 Female USA NaN NaN
2 2010 Male India NaN NaN
3 2010 Male USA 50.0 1.5
4 2011 Female India 30.0 1.0
5 2011 Female USA 80.0 2.0
6 2011 Male India 10.0 3.0
7 2011 Male USA 50.0 2.0
You can also set the missing values to zero by doing:
df.groupby(['Years', 'Gender', 'Country'])[['Salary', 'Work ex']].mean().fillna(0).reset_index()
Output:
Years Gender Country Salary Work ex
0 2010 Female India 0.0 0.0
1 2010 Female USA 0.0 0.0
2 2010 Male India 0.0 0.0
3 2010 Male USA 50.0 1.5
4 2011 Female India 30.0 1.0
5 2011 Female USA 80.0 2.0
6 2011 Male India 10.0 3.0
7 2011 Male USA 50.0 2.0