Home > Software engineering >  How do I fill NaN values with different random numbers on Python?
How do I fill NaN values with different random numbers on Python?

Time:04-02

I want to replace the missing values from a column with people's ages (which also contains numerical values, not only NaN values) but everything I've tried so far either doesn't work how I want it to or it doesn't work at all.

I wish to apply a random variable generator which follows a normal distribution using the mean and standard deviation obtained with that column.

I have tried the following:

  • Replacing with numpy, replaces NaN values but with the same number for all of them

    df_travel['Age'] = df_travel['Age'].replace(np.nan, round(rd.normalvariate(age_mean, age_std),0))
    
  • Fillna with pandas, also replaces NaN values but with the same number for all of them

    df_travel['Age'] = df_travel['Age'].fillna(round(rd.normalvariate(age_mean, age_std),0))
    
  • Applying a function on the dataframe with pandas, replaces NaN values but also changes all existing numerical values (I only wish to fill the NaN values)

    df_travel['Age'] = df_travel['Age'].where(df_travel['Age'].isnull() == True).apply(lambda v: round(rd.normalvariate(age_mean, age_std),0))
    

Any ideas would be appreciated. Thanks in advance.

CodePudding user response:

Series.fillna can accept a Series, so generate a random array of size len(df_travel):

rng = np.random.default_rng(0)
mu = df_travel['Age'].mean()
sd = df_travel['Age'].std()

filler = pd.Series(rng.normal(loc=mu, scale=sd, size=len(df_travel)))
df_travel['Age'] = df_travel['Age'].fillna(filler)

CodePudding user response:

I would go with it the following way:

# compute mean and std of `Age`
age_mean = df['Age'].mean()
age_std = df['Age'].std()

# number of NaN in `Age` column
num_na = df['Age'].isna().sum()

# generate `num_na` samples from N(age_mean, age_std**2) distribution
rand_vals = age_mean   age_std * np.random.randn(num_na)

# replace missing values with `rand_vals`
df.loc[df['Age'].isna(), 'Age'] = rand_vals
  • Related