I have a dataframe df
:
City | Territory | Region | Area | Sales |
---|---|---|---|---|
Chicopee | Springfield MA | Northeast | National | 58761 |
Chicopee | Springfield MA | Northeast | National | 65204 |
Chicopee | Springfield MA | Northeast | National | 79862 |
Feeding Hills | Springfield MA | Northeast | National | 67247 |
Holyoke | Springfield MA | Northeast | National | 64347 |
Holyoke | Springfield MA | Northeast | National | 73473 |
Northampton | Springfield MA | Northeast | National | 43349 |
South Hadley | Springfield MA | Northeast | National | 43551 |
South Hadley | Springfield MA | Northeast | National | 73633 |
Springfield | Springfield MA | Northeast | National | 64025 |
Springfield | Springfield MA | Northeast | National | 56670 |
Springfield | Springfield MA | Northeast | National | 79424 |
Springfield | Springfield MA | Northeast | National | 39118 |
Springfield | Springfield MA | Northeast | National | 74262 |
Leominster | Worcester MA | Northeast | National | 39348 |
Lunenburg | Worcester MA | Northeast | National | 33119 |
Westborough | Worcester MA | Northeast | National | 54086 |
Worcester | Worcester MA | Northeast | National | 79859 |
Worcester | Worcester MA | Northeast | National | 40721 |
Framingham | Worcester MA | Northeast | National | 79197 |
Framingham | Worcester MA | Northeast | National | 36837 |
Framingham | Worcester MA | Northeast | National | 66552 |
Framingham | Worcester MA | Northeast | National | 66355 |
Concord | Worcester MA | Northeast | National | 78933 |
Holliston | Worcester MA | Northeast | National | 37515 |
Hudson | Worcester MA | Northeast | National | 49778 |
I want to impute the columns with average of Sales for the respective columns at the respective levels:
City | Territory_Name | Region | Area | Sales |
---|---|---|---|---|
72628 | 52548.57143 | 58083.19231 | 53497.5 | 67241 |
72628 | 52548.57143 | 58083.19231 | 53497.5 | 68107 |
72628 | 52548.57143 | 58083.19231 | 53497.5 | 53908 |
52495 | 52548.57143 | 58083.19231 | 53497.5 | 42890 |
64779 | 52548.57143 | 58083.19231 | 53497.5 | 37501 |
64779 | 52548.57143 | 58083.19231 | 53497.5 | 64936 |
74919 | 52548.57143 | 58083.19231 | 53497.5 | 33580 |
35424.5 | 52548.57143 | 58083.19231 | 53497.5 | 68582 |
35424.5 | 52548.57143 | 58083.19231 | 53497.5 | 34911 |
57106.6 | 52548.57143 | 58083.19231 | 53497.5 | 56194 |
57106.6 | 52548.57143 | 58083.19231 | 53497.5 | 57250 |
57106.6 | 52548.57143 | 58083.19231 | 53497.5 | 36804 |
57106.6 | 52548.57143 | 58083.19231 | 53497.5 | 48133 |
57106.6 | 52548.57143 | 58083.19231 | 53497.5 | 57877 |
53656 | 57681.83333 | 58083.19231 | 53497.5 | 54927 |
49928 | 57681.83333 | 58083.19231 | 53497.5 | 45930 |
35606 | 57681.83333 | 58083.19231 | 53497.5 | 78042 |
43763.5 | 57681.83333 | 58083.19231 | 53497.5 | 66828 |
43763.5 | 57681.83333 | 58083.19231 | 53497.5 | 47742 |
47273.75 | 57681.83333 | 58083.19231 | 53497.5 | 57319 |
47273.75 | 57681.83333 | 58083.19231 | 53497.5 | 74662 |
47273.75 | 57681.83333 | 58083.19231 | 53497.5 | 45176 |
47273.75 | 57681.83333 | 58083.19231 | 53497.5 | 48625 |
77168 | 57681.83333 | 58083.19231 | 53497.5 | 49464 |
34433 | 57681.83333 | 58083.19231 | 53497.5 | 37793 |
51612 | 57681.83333 | 58083.19231 | 53497.5 | 36054 |
Is there an inbuilt package I can leverage to achieve this for all the categorical columns that I have?
CodePudding user response:
try this:
avg_sale = df.groupby('Region').agg(avg_sale_region=('Sales', np.mean))
df.merge(avg_sale, on='Region')
This is for region, you can do the same for other columns such as city, ...
for col in columns_you_want: #columns_you_want=['City', 'Region', ...]
avg_sale = df.groupby(col).agg(**{'avg_sale_by_' col: ('Sales', np.mean)})
df = df.merge(avg_sale, on=col)
after that you can rename the columns:
df = df.drop(columns=columns_you_want)
df = df.rename(columns={'avg_sale_by_' col: col for col in columns_you_want})