Replace NA in DataFrame for multiple columns with mean per country-CodePudding

I want to replace NA values with the mean of other column with the same year.

Note:
To replace NA values for Canada data, I want to use only the mean of Canada, not the mean from the whole dataset of course.

Here's a sample dataframe filled with random numbers. And some NA how i find them in my dataframe:

Country	Inhabitants	Year	Area	Cats	Dogs
Canada	38 000 000	2021	4	32	21
Canada	37 000 000	2020	4	NA	21
Canada	36 000 000	2019	3	32	21
Canada	NA	2018	2	32	21
Canada	34 000 000	2017	NA	32	21
Canada	35 000 000	2016	3	32	NA
Brazil	212 000 000	2021	5	32	21
Brazil	211 000 000	2020	4	NA	21
Brazil	210 000 000	2019	NA	32	21
Brazil	209 000 000	2018	4	32	21
Brazil	NA	2017	2	32	21
Brazil	207 000 000	2016	4	32	NA

What's the easiest way with pandas to replace those NA with the mean values of the other years? And is it possible to write a code for which it is possible to go through every NA and replace them (Inhabitants, Area, Cats, Dogs at once)?

CodePudding user response：

Note Example is based on your additional data source from the comments

Replacing the NA-Values for multiple columns with mean() you can combine the following three methods:

fillna() (Iterating per column axis should be 0, which is default value of fillna())
groupby()
transform()

Create data frame from your example:

df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')

Country name	year	Life Ladder	Log GDP per capita	Social support	Healthy life expectancy at birth	Freedom to make life choices	Generosity	Perceptions of corruption	Positive affect	Negative affect
Canada	2005	7.41805	10.6518	0.961552	71.3	0.957306	0.25623	0.502681	0.838544	0.233278
Canada	2007	7.48175	10.7392	nan	71.66	0.930341	0.249479	0.405608	0.871604	0.25681
Canada	2008	7.4856	10.7384	0.938707	71.84	0.926315	0.261585	0.369588	0.89022	0.202175
Canada	2009	7.48782	10.6972	0.942845	72.02	0.915058	0.246217	0.412622	0.867433	0.247633
Canada	2010	7.65035	10.7165	0.953765	72.2	0.933949	0.230451	0.41266	0.878868	0.233113

Call `fillna()` and iterate over all columns grouped by name of country:

df = df.fillna(df.groupby('Country name').transform('mean'))

Check your result for Canada:

df[df['Country name'] == 'Canada']

Country name	year	Life Ladder	Log GDP per capita	Social support	Healthy life expectancy at birth	Freedom to make life choices	Generosity	Perceptions of corruption	Positive affect	Negative affect
Canada	2005	7.41805	10.6518	0.961552	71.3	0.957306	0.25623	0.502681	0.838544	0.233278
Canada	2007	7.48175	10.7392	0.93547	71.66	0.930341	0.249479	0.405608	0.871604	0.25681
Canada	2008	7.4856	10.7384	0.938707	71.84	0.926315	0.261585	0.369588	0.89022	0.202175
Canada	2009	7.48782	10.6972	0.942845	72.02	0.915058	0.246217	0.412622	0.867433	0.247633
Canada	2010	7.65035	10.7165	0.953765	72.2	0.933949	0.230451	0.41266	0.878868	0.233113

CodePudding user response：

This also works:

In [2]:

df = pd.read_excel('DataPanelWHR2021C2.xls')

In [3]:

# Check for number of null values in df
df.isnull().sum()

Out [3]:

Country name                          0
year                                  0
Life Ladder                           0
Log GDP per capita                   36
Social support                       13
Healthy life expectancy at birth     55
Freedom to make life choices         32
Generosity                           89
Perceptions of corruption           110
Positive affect                      22
Negative affect                      16
dtype: int64

SOLUTION

In [4]:

# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)

In [5]:

# 2nd check for number of null values
df.isnull().sum()

Out [5]: No more NULL values

Country name                        0
year                                0
Life Ladder                         0
Log GDP per capita                  0
Social support                      0
Healthy life expectancy at birth    0
Freedom to make life choices        0
Generosity                          0
Perceptions of corruption           0
Positive affect                     0
Negative affect                     0
dtype: int64

Create data frame from your example:

Call fillna() and iterate over all columns grouped by name of country:

Check your result for Canada:

Call `fillna()` and iterate over all columns grouped by name of country: