Home > front end >  Impute Average value of one column grouped by other columns in Pandas Dataframe
Impute Average value of one column grouped by other columns in Pandas Dataframe

Time:07-15

I have a dataframe df :

City Territory Region Area Sales
Chicopee Springfield MA Northeast National 58761
Chicopee Springfield MA Northeast National 65204
Chicopee Springfield MA Northeast National 79862
Feeding Hills Springfield MA Northeast National 67247
Holyoke Springfield MA Northeast National 64347
Holyoke Springfield MA Northeast National 73473
Northampton Springfield MA Northeast National 43349
South Hadley Springfield MA Northeast National 43551
South Hadley Springfield MA Northeast National 73633
Springfield Springfield MA Northeast National 64025
Springfield Springfield MA Northeast National 56670
Springfield Springfield MA Northeast National 79424
Springfield Springfield MA Northeast National 39118
Springfield Springfield MA Northeast National 74262
Leominster Worcester MA Northeast National 39348
Lunenburg Worcester MA Northeast National 33119
Westborough Worcester MA Northeast National 54086
Worcester Worcester MA Northeast National 79859
Worcester Worcester MA Northeast National 40721
Framingham Worcester MA Northeast National 79197
Framingham Worcester MA Northeast National 36837
Framingham Worcester MA Northeast National 66552
Framingham Worcester MA Northeast National 66355
Concord Worcester MA Northeast National 78933
Holliston Worcester MA Northeast National 37515
Hudson Worcester MA Northeast National 49778

I want to impute the columns with average of Sales for the respective columns at the respective levels:

City Territory_Name Region Area Sales
72628 52548.57143 58083.19231 53497.5 67241
72628 52548.57143 58083.19231 53497.5 68107
72628 52548.57143 58083.19231 53497.5 53908
52495 52548.57143 58083.19231 53497.5 42890
64779 52548.57143 58083.19231 53497.5 37501
64779 52548.57143 58083.19231 53497.5 64936
74919 52548.57143 58083.19231 53497.5 33580
35424.5 52548.57143 58083.19231 53497.5 68582
35424.5 52548.57143 58083.19231 53497.5 34911
57106.6 52548.57143 58083.19231 53497.5 56194
57106.6 52548.57143 58083.19231 53497.5 57250
57106.6 52548.57143 58083.19231 53497.5 36804
57106.6 52548.57143 58083.19231 53497.5 48133
57106.6 52548.57143 58083.19231 53497.5 57877
53656 57681.83333 58083.19231 53497.5 54927
49928 57681.83333 58083.19231 53497.5 45930
35606 57681.83333 58083.19231 53497.5 78042
43763.5 57681.83333 58083.19231 53497.5 66828
43763.5 57681.83333 58083.19231 53497.5 47742
47273.75 57681.83333 58083.19231 53497.5 57319
47273.75 57681.83333 58083.19231 53497.5 74662
47273.75 57681.83333 58083.19231 53497.5 45176
47273.75 57681.83333 58083.19231 53497.5 48625
77168 57681.83333 58083.19231 53497.5 49464
34433 57681.83333 58083.19231 53497.5 37793
51612 57681.83333 58083.19231 53497.5 36054

Is there an inbuilt package I can leverage to achieve this for all the categorical columns that I have?

CodePudding user response:

try this:

avg_sale = df.groupby('Region').agg(avg_sale_region=('Sales', np.mean))
df.merge(avg_sale, on='Region')

This is for region, you can do the same for other columns such as city, ...

for col in columns_you_want: #columns_you_want=['City', 'Region', ...]
    avg_sale = df.groupby(col).agg(**{'avg_sale_by_' col: ('Sales', np.mean)})
    df = df.merge(avg_sale, on=col)

after that you can rename the columns:

df = df.drop(columns=columns_you_want)
df = df.rename(columns={'avg_sale_by_' col: col for col in columns_you_want})
  • Related