Example Data
ID | Name | Phone |
---|---|---|
1 | x | 212 |
2 | y | NaN |
3 | xy | NaN |
df is the name of the dataset The code below gave the names of the columns with no missing values.
no_nulls = set(df.columns[df.isnull().mean()==0])
isnull() will convert the dataset into something like this
ID | Name | Phone |
---|---|---|
False | False | False |
False | False | True |
False | False | True |
Can some one explain how mean will work on non-integers?
I used this and it worked but i am curious about mean
no_nulls = set(df.columns[df.notnull().all()])
CodePudding user response:
Your case, .mean()
is processing a dataframe of boolean values with True
and False
values only. In this case, .mean()
treat False
as 0
and True
as 1
. Hence, if you look at the result of df.isnull().mean()
, you will see:
df.isnull().mean()
ID 0.000000
Name 0.000000
Phone 0.666667
dtype: float64
Here, as columns ID
and Name
have all False
values, .mean()
will treat all as zeros and get a mean of zero. For column Phone
, you have one False
and 2 True
, hence, the mean is equivalent to taking mean of 0, 1, 1, i.e. 0.666667
.
As a result, when you check for df.isnull().mean()==0
, only the first 2 columns will be True
and hence, you get {'ID', 'Name'}
for the result of no_nulls
.
Referring to the official document of DataFrame.mean, you will get some hint from the parameter numeric_only=
and notice its default behavior with default setting:
Parameters
numeric_only bool, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.