In my dataframe, a value should be classed as missing if it is "nan", 0 or "missing".
I wrote a list
null_values = ["nan", 0, "missing"]
Then I checked which columns contained missing values
df.columns[df.isin(null_values).any()]
My df looks as follows:
{'Sick': {0: False, 1: False, 2: False, 3: False, 4: False, 5: False, 6: True},
'Name': {0: 'Liam',
1: 'Chloe',
2: 'Andy',
3: 'Max',
4: 'Anne',
5: nan,
6: 'Charlie'},
'Age': {0: 1.0, 1: 2.0, 2: 8.0, 3: 4.0, 4: 5.0, 5: 2.0, 6: 9.0}}
It flags the column 'Sick' as containing missing values even though it only contains FALSE/TRUE values. However, it correctly recognises that Age has no missing values. Why does it count FALSE as a missing value when I have not defined it as such in my list?
CodePudding user response:
One idea is exclude boolean columns by DataFrame.select_dtypes
:
null_values = ["nan", 0, "missing"]
df1 = df.select_dtypes(exclude=bool)
print (df1.columns[df1.isin(null_values).any()])
Index([], dtype='object')
More info is here.
If need also compare NaN
missing values add it to list
:
null_values = ["nan", 0, "missing", np.nan]
df1 = df.select_dtypes(exclude=bool)
print (df1.columns[df1.isin(null_values).any()])
Index(['Name'], dtype='object')
EDIT: One trick is convert all values to strings, then need compare '0'
for strings from integers and '0.0'
for strings from floats:
null_values = ["nan", '0', '0.0', "missing"]
print (df.columns[df.astype(str).isin(null_values).any()])
Index(['Name'], dtype='object')