How to replace NaN values in cells with a different random value-CodePudding

I'm trying to fill empty cells from a single column (and then from multiple columns) using the standard deviation. I have seen several explanations but so far haven't seen any that is straightforward and pertinent to this question. Thank you in advance for your time!

Here is what I have so far:

To obtain random values, I first calculated the mean for the Math column:

filename = os.path.join(os.path.dirname(__file__),'exam.csv')
data = pd.read_csv(filename)

math_mean = data['Math'].mean()
print(math_mean)
The output of that is 73.625

Then, I calculated the Standard deviation:

math_std = data['Math'].std()
print(math_std)

Which gave me 10.14097064951308 for STD.

73-10 = 63
73 10 = 83

Based on the calculation above, I need to generate random values between 63 and 83 and fill in the empty cells of the Math column. I used np.random.randint() to get the random values:

random_grades = np.random.randint(63,83)
print(random_grades)

So everything is fine so far, and the random values do get inserted into the empty cells. However, it's the same value repeating over and over.

I'm using this syntax:

#Producing random age values
random_grades = np.random.randint(63.48, 83.76)
print(random_grades)

#Storing new random ages to variable
new_grades = data.Math.fillna(random_grades)
print(new_grades)

#Passing new values to table
data['Math'].fillna(new_grades,inplace=True)
print(data)

This is the output:

0     79.0
1     82.0
2     85.0
3     77.0 # random value
4     70.0
5     77.0 # same random value
6     84.0
7     67.0
8     77.0 # same random value
9     63.0
10    59.0

If I try to add the parameter size= in random_grades = np.random.randint(63.48, 83.76, size=5) I get an error:

Traceback (most recent call last):
  File "c:/Users/Desktop/tiny.py", line 88, in <module>
    new_grades = data.Math.fillna(random_grades)
  File "C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\series.py", line 4433, in fillna
    return super().fillna(
  File "C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\generic.py", line 6397, in fillna
    raise TypeError(
TypeError: "value" parameter must be a scalar, dict or Series, but you passed a "ndarray"

What am I missing?

CodePudding user response：

First idea is test missing values and assign length of random values count by number (by sum with True = 1 and False = 0) of missing values by N parameter in numpy.random.randint:

mean = data['Math'].mean()
std = data['Math'].std()

mask = data['Math'].isna()
data.loc[mask, 'Math'] = np.random.randint(mean - std, mean   std, mask.sum())

Your solution is similar, only is converted numpy array to Series with index for missing values:

mean = data['Math'].mean()
std = data['Math'].std()

#Testing missing values
mask = data['Math'].isna()

#Producing random age values
random_grades = np.random.randint(mean - std, mean   std, mask.sum())
print(random_grades)

#convert array to Series and replace missing values
data['Math'] = data.Math.fillna(pd.Series(random_grades, index=data.index[mask]))
print(data)

CodePudding user response：

Try using apply for different values

df.Math = df.Math.apply(lambda x: np.random.randint(63.48, 83.76) if pd.isna(x) else x)

CodePudding user response：

np.random.randint(63.48, 83.76)

generates a single value, whereas-

np.random.randint(63.48, 83.76, 5)

generates 5 random values within the limits specified.