Home > Blockchain >  Replace NA in DataFrame for multiple columns with mean per country
Replace NA in DataFrame for multiple columns with mean per country

Time:12-13

I want to replace NA values with the mean of other column with the same year.

Note:
To replace NA values for Canada data, I want to use only the mean of Canada, not the mean from the whole dataset of course.

Here's a sample dataframe filled with random numbers. And some NA how i find them in my dataframe:

Country Inhabitants Year Area Cats Dogs
Canada 38 000 000 2021 4 32 21
Canada 37 000 000 2020 4 NA 21
Canada 36 000 000 2019 3 32 21
Canada NA 2018 2 32 21
Canada 34 000 000 2017 NA 32 21
Canada 35 000 000 2016 3 32 NA
Brazil 212 000 000 2021 5 32 21
Brazil 211 000 000 2020 4 NA 21
Brazil 210 000 000 2019 NA 32 21
Brazil 209 000 000 2018 4 32 21
Brazil NA 2017 2 32 21
Brazil 207 000 000 2016 4 32 NA

What's the easiest way with pandas to replace those NA with the mean values of the other years? And is it possible to write a code for which it is possible to go through every NA and replace them (Inhabitants, Area, Cats, Dogs at once)?

CodePudding user response:

Note Example is based on your additional data source from the comments

Replacing the NA-Values for multiple columns with mean() you can combine the following three methods:

  • fillna() (Iterating per column axis should be 0, which is default value of fillna())
  • groupby()
  • transform()

Create data frame from your example:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name year Life Ladder Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
Canada 2005 7.41805 10.6518 0.961552 71.3 0.957306 0.25623 0.502681 0.838544 0.233278
Canada 2007 7.48175 10.7392 nan 71.66 0.930341 0.249479 0.405608 0.871604 0.25681
Canada 2008 7.4856 10.7384 0.938707 71.84 0.926315 0.261585 0.369588 0.89022 0.202175
Canada 2009 7.48782 10.6972 0.942845 72.02 0.915058 0.246217 0.412622 0.867433 0.247633
Canada 2010 7.65035 10.7165 0.953765 72.2 0.933949 0.230451 0.41266 0.878868 0.233113

Call fillna() and iterate over all columns grouped by name of country:
df = df.fillna(df.groupby('Country name').transform('mean'))

Check your result for Canada:
df[df['Country name'] == 'Canada']
Country name year Life Ladder Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
Canada 2005 7.41805 10.6518 0.961552 71.3 0.957306 0.25623 0.502681 0.838544 0.233278
Canada 2007 7.48175 10.7392 0.93547 71.66 0.930341 0.249479 0.405608 0.871604 0.25681
Canada 2008 7.4856 10.7384 0.938707 71.84 0.926315 0.261585 0.369588 0.89022 0.202175
Canada 2009 7.48782 10.6972 0.942845 72.02 0.915058 0.246217 0.412622 0.867433 0.247633
Canada 2010 7.65035 10.7165 0.953765 72.2 0.933949 0.230451 0.41266 0.878868 0.233113

CodePudding user response:

This also works:

In [2]:

df = pd.read_excel('DataPanelWHR2021C2.xls')

In [3]:

# Check for number of null values in df
df.isnull().sum()

Out [3]:

Country name                          0
year                                  0
Life Ladder                           0
Log GDP per capita                   36
Social support                       13
Healthy life expectancy at birth     55
Freedom to make life choices         32
Generosity                           89
Perceptions of corruption           110
Positive affect                      22
Negative affect                      16
dtype: int64

SOLUTION

In [4]:

# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)

In [5]:

# 2nd check for number of null values
df.isnull().sum()

Out [5]: No more NULL values

Country name                        0
year                                0
Life Ladder                         0
Log GDP per capita                  0
Social support                      0
Healthy life expectancy at birth    0
Freedom to make life choices        0
Generosity                          0
Perceptions of corruption           0
Positive affect                     0
Negative affect                     0
dtype: int64
  • Related