What does np.mean(data.isnull()) exactly?-CodePudding

in creating a cleaning project throught Python, I've found this code:

# let's see if there is any missing data

for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing,2)))

Which actually works fine, giving back the % of null values per column in the dataframe, but I'm a little confused on how it works:

First we define a loop for each column in the dataframe, then we execute that mean but exactly the mean of what? The mean for each columns of the quantity of null cells or what?

Just for reference, I've worked around it with this:

NullValues=df.isnull().sum()/len(df)
print('{} - {}%'.format(col, round(NullValues,2)))

that gives me back basically the same results but just to understand the mechanism...I'm confused about the first block of code...

CodePudding user response：

df[col].isnull() is assigning a boolean (True/False) depending on the NA/null state of the values.

np.mean computes the average of the values, with True as 1 and False as 0, which is a equivalent of computing the proportion of Null values in the column.

np.mean([True, False, False, False])

# equivalent to 
np.mean([1, 0, 0, 0])

# 0.25

CodePudding user response：

It's something that's very intuitive once you're used to it. The steps leading to this kind of code could be like the following:

To get the percentage of null values, we need to count all null rows, and divide the count by the total number of rows.
So, first we need to detect the null rows. This is easy, as there is a provided method: df[col].isnull().
The result of df[col].isnull() is a new column consisting of booleans -- True or False.
Now we need to count the Trues. Here we can realize that counting Trues in a boolean array is the same as summing the array: True can be converted to 1, and False to zero.
So we would be left with df[col].isnull().sum() / len(df[col]).
But summing and dividing by the length is just the arithmetic mean! Therefore, we can shorten this to arrive at the final result: mean(df[col].isnull()).

CodePudding user response：

the first thing that happens is the

df[col].isnull()

this creates a list of bool values with True beeing when the column is null so if for example the values are [x1, x2, x3, null, x4] then it gives the vector [False, False, False, True, False].

The next step is then the np.mean funcion. This function calculates the mean value of the vector but replaces True with 1 and False with 0. This gives the vector [0, 0, 0, 1, 0]

The mean of this vector is equal to the sum of nulls divided by the length of the vector which is the method you are using.

Just a comment. It does not give a percent you need to multiply by 100.