Add values to missing items 75% of the time-CodePudding

I am following a tutorial, however on the tutorial they put this code with no explanation whatsoever.

I am having difficulties understading what is happening here and your help is much appreciated

import numpy as np

missing_rate = 0.75
n_missing_samples = int(np.floor(data.shape[0] * missing_rate))
missing_samples = np.hstack(
    (
        np.zeros(data.shape[0] - n_missing_samples, dtype=np.bool),
        np.ones(n_missing_samples, dtype=np.bool),
    )
)
rng = np.random.RandomState(0)
rng.shuffle(missing_samples)
missing_features = rng.randint(0, data.shape[1], n_missing_samples)
data.values[np.where(missing_samples)[0], missing_features] = np.nan

CodePudding user response：

Let's follow the procedure line-by-line by using the mock data such as:

import pandas as pd

rng = np.random.randn(10,5)
data = pd.DataFrame(rng, columns=['A','B','C','D','E'])
data = round(data, 1)

This gives the following data frame:

       A       B       C       D       E
0   -1.5    -0.0    -2.0     0.9    -1.5
1   -0.7    -1.8    -0.9     0.8     0.9
2    0.6    -0.7     3.2     0.8     0.6
3   -0.3    -0.4    -0.6    -0.4     0.1
4   -0.1    -0.9     0.4     0.0    -0.5
5   -0.5    -1.3     0.4     0.7     0.7
6    2.4     1.2     0.1     0.5     1.1
7    0.9     0.8    -0.5     1.7     2.1
8   -1.8     1.3    -2.3     0.5    -1.1
9   -0.8     1.4    -0.4     0.0     1.5

The first two lines define a ratio for assigning rows that will have one missing feature

missing_rate = 0.75 
n_missing_samples = int(np.floor(data.shape[0] * missing_rate))

The output of n_missing_samples is 7 (and the number of complete samples is 10 - 7 = 3). Then, the missing_samples block horizontally stacks three FALSE and 7 TRUE values:

array([False, False, False,  True,  True,  True,  True,  True,  True,
    True])

This boolean array is shuffled next:

rng = np.random.RandomState(0)
rng.shuffle(missing_samples)

The boolean array becomes:

array([False,  True,  True,  True, False,  True,  True,  True, False,
    True])

So, the rows with True will have a feature with nan assigned. Then the code randomly chooses a column for each row that has evaluated True:

missing_features = rng.randint(0, data.shape[1], n_missing_samples)

Now we have the row and column indices, which can be replaced by NaN:

data.values[np.where(missing_samples)[0], missing_features] = np.nan

The resultant data frame becomes:

       A       B       C       D       E
0   -1.5    -0.0    -2.0     0.9    -1.5
1   -0.7    -1.8    -0.9     0.8     NaN
2    NaN    -0.7     3.2     0.8     0.6
3    NaN    -0.4    -0.6    -0.4     0.1
4   -0.1    -0.9     0.4     0.0    -0.5
5   -0.5    -1.3     0.4     0.7     NaN
6    2.4     1.2     NaN     0.5     1.1
7    0.9     NaN    -0.5     1.7     2.1
8   -1.8     1.3    -2.3     0.5    -1.1
9    NaN     1.4    -0.4     0.0     1.5