Home > Software engineering >  what will be the mean of a column if there are missing values in the column?
what will be the mean of a column if there are missing values in the column?

Time:09-28

Example Data

ID Name Phone
1 x 212
2 y NaN
3 xy NaN

df is the name of the dataset The code below gave the names of the columns with no missing values.

no_nulls = set(df.columns[df.isnull().mean()==0])

isnull() will convert the dataset into something like this

ID Name Phone
False False False
False False True
False False True

Can some one explain how mean will work on non-integers?

I used this and it worked but i am curious about mean

no_nulls = set(df.columns[df.notnull().all()]) 

CodePudding user response:

Your case, .mean() is processing a dataframe of boolean values with True and False values only. In this case, .mean() treat False as 0 and True as 1. Hence, if you look at the result of df.isnull().mean(), you will see:

df.isnull().mean()

ID       0.000000
Name     0.000000
Phone    0.666667
dtype: float64

Here, as columns ID and Name have all False values, .mean() will treat all as zeros and get a mean of zero. For column Phone, you have one False and 2 True, hence, the mean is equivalent to taking mean of 0, 1, 1, i.e. 0.666667.

As a result, when you check for df.isnull().mean()==0, only the first 2 columns will be True and hence, you get {'ID', 'Name'} for the result of no_nulls.

Referring to the official document of DataFrame.mean, you will get some hint from the parameter numeric_only= and notice its default behavior with default setting:

Parameters

numeric_only bool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

  • Related