in creating a cleaning project throught Python, I've found this code:
# let's see if there is any missing data
for col in df.columns:
pct_missing = np.mean(df[col].isnull())
print('{} - {}%'.format(col, round(pct_missing,2)))
Which actually works fine, giving back the % of null values per column in the dataframe, but I'm a little confused on how it works:
First we define a loop for each column in the dataframe, then we execute that mean but exactly the mean of what? The mean for each columns of the quantity of null cells or what?
Just for reference, I've worked around it with this:
NullValues=df.isnull().sum()/len(df)
print('{} - {}%'.format(col, round(NullValues,2)))
that gives me back basically the same results but just to understand the mechanism...I'm confused about the first block of code...
CodePudding user response:
df[col].isnull()
is assigning a boolean (True
/False
) depending on the NA/null state of the values.
np.mean
computes the average of the values, with True
as 1
and False
as 0
, which is a equivalent of computing the proportion of Null values in the column.
np.mean([True, False, False, False])
# equivalent to
np.mean([1, 0, 0, 0])
# 0.25
CodePudding user response:
It's something that's very intuitive once you're used to it. The steps leading to this kind of code could be like the following:
- To get the percentage of null values, we need to count all null rows, and divide the count by the total number of rows.
- So, first we need to detect the null rows. This is easy, as there is a provided method:
df[col].isnull()
. - The result of
df[col].isnull()
is a new column consisting of booleans --True
orFalse
. - Now we need to count the
True
s. Here we can realize that countingTrue
s in a boolean array is the same as summing the array:True
can be converted to 1, andFalse
to zero. - So we would be left with
df[col].isnull().sum() / len(df[col])
. - But summing and dividing by the length is just the arithmetic mean! Therefore, we can shorten this to arrive at the final result:
mean(df[col].isnull())
.
CodePudding user response:
the first thing that happens is the
df[col].isnull()
this creates a list of bool values with True beeing when the column is null so if for example the values are [x1, x2, x3, null, x4] then it gives the vector [False, False, False, True, False].
The next step is then the np.mean funcion. This function calculates the mean value of the vector but replaces True with 1 and False with 0. This gives the vector [0, 0, 0, 1, 0]
The mean of this vector is equal to the sum of nulls divided by the length of the vector which is the method you are using.
Just a comment. It does not give a percent you need to multiply by 100.