I am following a tutorial, however on the tutorial they put this code with no explanation whatsoever.
I am having difficulties understading what is happening here and your help is much appreciated
import numpy as np
missing_rate = 0.75
n_missing_samples = int(np.floor(data.shape[0] * missing_rate))
missing_samples = np.hstack(
(
np.zeros(data.shape[0] - n_missing_samples, dtype=np.bool),
np.ones(n_missing_samples, dtype=np.bool),
)
)
rng = np.random.RandomState(0)
rng.shuffle(missing_samples)
missing_features = rng.randint(0, data.shape[1], n_missing_samples)
data.values[np.where(missing_samples)[0], missing_features] = np.nan
CodePudding user response:
Let's follow the procedure line-by-line by using the mock data such as:
import pandas as pd
rng = np.random.randn(10,5)
data = pd.DataFrame(rng, columns=['A','B','C','D','E'])
data = round(data, 1)
This gives the following data frame:
A B C D E
0 -1.5 -0.0 -2.0 0.9 -1.5
1 -0.7 -1.8 -0.9 0.8 0.9
2 0.6 -0.7 3.2 0.8 0.6
3 -0.3 -0.4 -0.6 -0.4 0.1
4 -0.1 -0.9 0.4 0.0 -0.5
5 -0.5 -1.3 0.4 0.7 0.7
6 2.4 1.2 0.1 0.5 1.1
7 0.9 0.8 -0.5 1.7 2.1
8 -1.8 1.3 -2.3 0.5 -1.1
9 -0.8 1.4 -0.4 0.0 1.5
The first two lines define a ratio for assigning rows that will have one missing feature
missing_rate = 0.75
n_missing_samples = int(np.floor(data.shape[0] * missing_rate))
The output of n_missing_samples
is 7 (and the number of complete samples is 10 - 7 = 3). Then, the missing_samples
block horizontally stacks three FALSE and 7 TRUE values:
array([False, False, False, True, True, True, True, True, True,
True])
This boolean array is shuffled next:
rng = np.random.RandomState(0)
rng.shuffle(missing_samples)
The boolean array becomes:
array([False, True, True, True, False, True, True, True, False,
True])
So, the rows with True
will have a feature with nan assigned. Then the code randomly chooses a column for each row that has evaluated True
:
missing_features = rng.randint(0, data.shape[1], n_missing_samples)
Now we have the row and column indices, which can be replaced by NaN:
data.values[np.where(missing_samples)[0], missing_features] = np.nan
The resultant data frame becomes:
A B C D E
0 -1.5 -0.0 -2.0 0.9 -1.5
1 -0.7 -1.8 -0.9 0.8 NaN
2 NaN -0.7 3.2 0.8 0.6
3 NaN -0.4 -0.6 -0.4 0.1
4 -0.1 -0.9 0.4 0.0 -0.5
5 -0.5 -1.3 0.4 0.7 NaN
6 2.4 1.2 NaN 0.5 1.1
7 0.9 NaN -0.5 1.7 2.1
8 -1.8 1.3 -2.3 0.5 -1.1
9 NaN 1.4 -0.4 0.0 1.5