I want to replace NA values with the mean of other column with the same year.
Note:
To replace NA values for Canada data, I want to use only the mean of Canada, not the mean from the whole dataset of course.
Here's a sample dataframe filled with random numbers. And some NA how i find them in my dataframe:
Country | Inhabitants | Year | Area | Cats | Dogs |
---|---|---|---|---|---|
Canada | 38 000 000 | 2021 | 4 | 32 | 21 |
Canada | 37 000 000 | 2020 | 4 | NA | 21 |
Canada | 36 000 000 | 2019 | 3 | 32 | 21 |
Canada | NA | 2018 | 2 | 32 | 21 |
Canada | 34 000 000 | 2017 | NA | 32 | 21 |
Canada | 35 000 000 | 2016 | 3 | 32 | NA |
Brazil | 212 000 000 | 2021 | 5 | 32 | 21 |
Brazil | 211 000 000 | 2020 | 4 | NA | 21 |
Brazil | 210 000 000 | 2019 | NA | 32 | 21 |
Brazil | 209 000 000 | 2018 | 4 | 32 | 21 |
Brazil | NA | 2017 | 2 | 32 | 21 |
Brazil | 207 000 000 | 2016 | 4 | 32 | NA |
What's the easiest way with pandas to replace those NA with the mean values of the other years? And is it possible to write a code for which it is possible to go through every NA and replace them (Inhabitants, Area, Cats, Dogs at once)?
CodePudding user response:
Note Example is based on your additional data source from the comments
Replacing the NA-Values for multiple columns with mean()
you can combine the following three methods:
fillna()
(Iterating per columnaxis
should be 0, which is default value offillna()
)groupby()
transform()
Create data frame from your example:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect |
---|---|---|---|---|---|---|---|---|---|---|
Canada | 2005 | 7.41805 | 10.6518 | 0.961552 | 71.3 | 0.957306 | 0.25623 | 0.502681 | 0.838544 | 0.233278 |
Canada | 2007 | 7.48175 | 10.7392 | nan | 71.66 | 0.930341 | 0.249479 | 0.405608 | 0.871604 | 0.25681 |
Canada | 2008 | 7.4856 | 10.7384 | 0.938707 | 71.84 | 0.926315 | 0.261585 | 0.369588 | 0.89022 | 0.202175 |
Canada | 2009 | 7.48782 | 10.6972 | 0.942845 | 72.02 | 0.915058 | 0.246217 | 0.412622 | 0.867433 | 0.247633 |
Canada | 2010 | 7.65035 | 10.7165 | 0.953765 | 72.2 | 0.933949 | 0.230451 | 0.41266 | 0.878868 | 0.233113 |
Call fillna()
and iterate over all columns grouped by name of country:
df = df.fillna(df.groupby('Country name').transform('mean'))
Check your result for Canada:
df[df['Country name'] == 'Canada']
Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect |
---|---|---|---|---|---|---|---|---|---|---|
Canada | 2005 | 7.41805 | 10.6518 | 0.961552 | 71.3 | 0.957306 | 0.25623 | 0.502681 | 0.838544 | 0.233278 |
Canada | 2007 | 7.48175 | 10.7392 | 0.93547 | 71.66 | 0.930341 | 0.249479 | 0.405608 | 0.871604 | 0.25681 |
Canada | 2008 | 7.4856 | 10.7384 | 0.938707 | 71.84 | 0.926315 | 0.261585 | 0.369588 | 0.89022 | 0.202175 |
Canada | 2009 | 7.48782 | 10.6972 | 0.942845 | 72.02 | 0.915058 | 0.246217 | 0.412622 | 0.867433 | 0.247633 |
Canada | 2010 | 7.65035 | 10.7165 | 0.953765 | 72.2 | 0.933949 | 0.230451 | 0.41266 | 0.878868 | 0.233113 |
CodePudding user response:
This also works:
In [2]:
df = pd.read_excel('DataPanelWHR2021C2.xls')
In [3]:
# Check for number of null values in df
df.isnull().sum()
Out [3]:
Country name 0
year 0
Life Ladder 0
Log GDP per capita 36
Social support 13
Healthy life expectancy at birth 55
Freedom to make life choices 32
Generosity 89
Perceptions of corruption 110
Positive affect 22
Negative affect 16
dtype: int64
SOLUTION
In [4]:
# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)
In [5]:
# 2nd check for number of null values
df.isnull().sum()
Out [5]: No more NULL values
Country name 0
year 0
Life Ladder 0
Log GDP per capita 0
Social support 0
Healthy life expectancy at birth 0
Freedom to make life choices 0
Generosity 0
Perceptions of corruption 0
Positive affect 0
Negative affect 0
dtype: int64