I have a Dataframe that have some null values, but also other entries that I should count as missing. The forms of missing that I want to take into account are:
- The normal null value from pandas
- The string N/A
- 0.0
- "-"
I want to identify the percentage of missing values per column.
I tried this
# Total null values
mis_val = df.isnull().sum()
# N/A values
mis_val = mis_val (df=='N/A').sum()
# Percentage of total data
mis_val_percent = 100 * mis_val / len(df)
But the second line of code doesn't seem to do what I expected. I wanted it to count the number of 'N/A' per column
CodePudding user response:
If the missing forms are always the same, here is another way of doing it:
import pandas as pd
import numpy as np
data = {'col1':[10.0,20.0,np.nan,'N/A',0,25],
'col2':[0,np.nan,'N/A','N/A','','-']}
df = pd.DataFrame(data)
# The 4 "forms of missing":
missing_1 = (df=='N/A').sum()
missing_2 = df.isna().sum()
missing_3 = df.isnull().sum()
missing_4 = (df=='-').sum()
mis_val_percent =100*(missing_1 missing_2 missing_3 missing_4)/len(df)
print(mis_val_percent)
outputs:
col1 50.000000
col2 83.333333
CodePudding user response:
here is one way about it
# define regex pattern on values that you like treated as null
# remember to escape the regex character
# defined N/A, 0.0, and - : /, . and - are all escaped with \
# each patter is separated with |
pat = 'N\/A|0\.0|\-'
# replace values defined in pat with np.nan
# check if its null and take the sum
df['col'].replace(pat, np.nan, regex=True).isna().sum()
CodePudding user response:
Using isin([])
mis_val = df.Data.isin([pd.NA, np.nan, None, '-', 'N/A', 0]).sum()
Example:
df = pd.DataFrame({'Data': [1, 2, 3, 'text', pd.NA, 5, 0, 4, '-', 12, 'N/A', 20, np.nan]})
mis_val = df.Data.isin([pd.NA, np.nan, None, '-', 'N/A', 0]).sum()
mis_val_percent = 100 * mis_val / df.size
print(f'Missed values rate is {mis_val_percent:.02f}%')
Missed values rate is 38.46%
Side note. About using pd.NA vs np.nan for pandas
CodePudding user response:
If you want count N/A in particular column:
df["Col_name"].isna().sum()
If you want count N/A in full data frame:
df.isna().sum().sum()