Home > Net >  function to replace null values with mean
function to replace null values with mean

Time:12-11

I have an unemployment data for the 30 countries and there are some missing values but in the excel sheet these all numbers are all strings so I first convert them to floats and then if row is empty then I want to replace row with its columns mean value. Function works well doesnt return any error but when I print the data still I have the Null values

 data=pd.read_excel(r'C:\Users\OĞUZ\Desktop\employment.xlsx')
    data=data.set_index('Unnamed: 0')
    for column in data:
        for row in column:
            if len(row)>5:
                row=float(row)
                if row.isnull():
                    row=column.mean()
print(data['Argentina'].head())

This is what I get after print.

Unnamed: 0
1990   NaN
1991   NaN
1992   NaN
1993   NaN
1994   NaN
Name: Argentina, dtype: float64

CodePudding user response:

To fill NaNs use df.fillna(value). For the mean use df.mean(). If your column is named Argentina this could look like below:

df.Argentina.fillna(df.Argentina.mean(), inplace=True)

The inplace=True is for the reassignment. The line is equivalent to

df.Argentina = df.Argentina.fillna(df.Argentina.mean())

Example

df = pd.DataFrame({'Argentina':[1,np.nan,2,4]}, index=[1990, 1991, 1992, 1993])
>>> df  
        Argentina
1990    1.0
1991    NaN
1992    2.0
1993    4.0

df.Argentina.fillna(df.Argentina.mean(), inplace=True)

>>> df
        Argentina
1990    1.000000
1991    2.333333
1992    2.000000
1993    4.000000

CodePudding user response:

You can either iterate over the columns, or use DataFrame.transform or DataFrame.apply.

Whichever approach you use, you'll want to:

  1. Convert column values from strings to floats
  2. Calculate the mean of the column
  3. Use Series.fillna to fill the NaN values with the previously calcualted value

Create Data

import pandas as pd
import numpy as np
rng = np.random.default_rng(0)

df = pd.DataFrame({
    "a": rng.integers(5, size=10),
    "b": rng.integers(5, 10, size=10),
    "c": rng.integers(10, 15, size=10)
}).astype(str)
df.loc[2:5, :] = np.nan

# note all the numbers you see are actually strings
print(df)
     a    b    c
0    4    8   11
1    3    9   14
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN
5  NaN  NaN  NaN
6    0    8   12
7    0    7   10
8    0    7   13
9    4    9   13

Solution - DataFrame transform

def clean_column(series):
    series = pd.to_numeric(series, downcast="float")
    avg = series.mean()
    return series.fillna(avg)

new_df = df.transform(clean_column)

print(new_df)
0  4.000000  8.0  11.000000
1  3.000000  9.0  14.000000
2  1.833333  8.0  12.166667
3  1.833333  8.0  12.166667
4  1.833333  8.0  12.166667
5  1.833333  8.0  12.166667
6  0.000000  8.0  12.000000
7  0.000000  7.0  10.000000
8  0.000000  7.0  13.000000
9  4.000000  9.0  13.000000
  • Related