Home > Software design >  how do you groupby multiple columns in Pandas and add rows for missing groups
how do you groupby multiple columns in Pandas and add rows for missing groups

Time:09-27

Say in my dataset I have 3 nominal/categorical variables-Year(2 possible unique values), Gender(2 possible unique values), Country(2 possibleunique values) and 2 numerical variables- Work exp in years and Salary. Thus we can make 8 (2x2X2) possible combinations of categorical variables. However, my data does not have all the combinations but lets say 5 out of 8 (see the data example below).

Example: Data (Obtained after pandas group by)- 5 group combinations

df.groupby(['Years','Gender','Country'])[['Salary','Work ex']].mean()
df.reset_index(inpalce=True) 
Years Gender Country Salary Work ex
2010 Male USA 50 2
2011 Female India 30 1
2011 Male India 10 3
2011 Male USA 50 2
2011 Female USA 80 2

Now what I want is to have rows with all the combinations of categorical variables i.e. 8 rows, thus, for the new 3 rows the numercical variables will have null values and for rest 5 it would have values as shown below

Years Gender Country Mean Salary Mean Work ex
2010 Male USA 50 2
2010 Male India NA NA
2010 Female USA NA NA
2010 Female India NA NA
2011 Female India 30 1
2011 Male India 10 3
2011 Male USA 50 2
2011 Female USA 80 2

PS: My original data had years, gender, country, salary, work exp as variables. I have grouped (on years,gender,country) and summarised (on work ex and salary).That led to data above with only 5 different group combinatins out of 8. Now, I want to add the rest of the possible groups (3 groups) with null values.

CodePudding user response:

Assuming you achieved step 1 and lets call it df_grp.

Then create a dataframe with all possible combination of ['Years', 'Gender', 'Country'] like:

df_all = pd.MultiIndex.from_product([df_grp['Years'].unique(), df_grp['Gender'
                                    ].unique(), df_grp['Country'
                                    ].unique()]).to_frame().reset_index(drop=True)
df_all.columns = ['Years', 'Gender', 'Country']

Then do an outer merge with df_grp

out = df_all.merge(df_grp, on=['Years', 'Gender', 'Country'], how = 'outer')

print(out):

   Years  Gender Country  Mean Salary  Mean Work ex.
0   2010    Male   India          NaN            NaN
1   2010    Male     USA         50.0            1.5
2   2010  Female   India          NaN            NaN
3   2010  Female     USA          NaN            NaN
4   2011    Male   India         10.0            3.0
5   2011    Male     USA         50.0            2.0
6   2011  Female   India         30.0            1.0
7   2011  Female     USA         80.0            2.0

CodePudding user response:

Make sure the variables are categories, then use pd.groupby():

df = pd.DataFrame({'Years': {0: 2010, 1: 2011, 2: 2011, 3: 2011, 4: 2011, 5: 2010},
                   'Gender': {0: 'Male', 1: 'Female', 2: 'Male', 3: 'Male', 4: 'Female', 5: 'Male'},
                   'Country': {0: 'USA', 1: 'India', 2: 'India', 3: 'USA', 4: 'USA', 5: 'USA'},
                   'Salary': {0: 50, 1: 30, 2: 10, 3: 50, 4: 80, 5: 50},
                   'Work ex': {0: 2, 1: 1, 2: 3, 3: 2, 4: 2, 5: 1}})

df[['Years', 'Gender', 'Country']] = df[['Years', 'Gender', 'Country']].astype('category')

df.groupby(['Years', 'Gender', 'Country'])[['Salary', 'Work ex']].mean().reset_index()

Output:

  Years  Gender Country  Salary  Work ex
0  2010  Female   India     NaN      NaN
1  2010  Female     USA     NaN      NaN
2  2010    Male   India     NaN      NaN
3  2010    Male     USA    50.0      1.5
4  2011  Female   India    30.0      1.0
5  2011  Female     USA    80.0      2.0
6  2011    Male   India    10.0      3.0
7  2011    Male     USA    50.0      2.0

You can also set the missing values to zero by doing:

df.groupby(['Years', 'Gender', 'Country'])[['Salary', 'Work ex']].mean().fillna(0).reset_index()

Output:

  Years  Gender Country  Salary  Work ex
0  2010  Female   India     0.0      0.0
1  2010  Female     USA     0.0      0.0
2  2010    Male   India     0.0      0.0
3  2010    Male     USA    50.0      1.5
4  2011  Female   India    30.0      1.0
5  2011  Female     USA    80.0      2.0
6  2011    Male   India    10.0      3.0
7  2011    Male     USA    50.0      2.0
  • Related