Home > Software engineering >  how to count missing values that are not the regular missing in pandas
how to count missing values that are not the regular missing in pandas

Time:11-04

I have a Dataframe that have some null values, but also other entries that I should count as missing. The forms of missing that I want to take into account are:

  • The normal null value from pandas
  • The string N/A
  • 0.0
  • "-"

I want to identify the percentage of missing values per column.

I tried this

   # Total null values
    mis_val = df.isnull().sum()
    
    # N/A values
    mis_val = mis_val (df=='N/A').sum()
    
    # Percentage of total data
    mis_val_percent = 100 * mis_val / len(df)

But the second line of code doesn't seem to do what I expected. I wanted it to count the number of 'N/A' per column

CodePudding user response:

If the missing forms are always the same, here is another way of doing it:

import pandas as pd
import numpy as np

data = {'col1':[10.0,20.0,np.nan,'N/A',0,25],
        'col2':[0,np.nan,'N/A','N/A','','-']}
df = pd.DataFrame(data)

# The 4 "forms of missing": 
missing_1 = (df=='N/A').sum()
missing_2 = df.isna().sum()
missing_3 = df.isnull().sum()
missing_4 = (df=='-').sum()

mis_val_percent =100*(missing_1 missing_2 missing_3 missing_4)/len(df)
print(mis_val_percent)

outputs:

col1    50.000000
col2    83.333333

CodePudding user response:

here is one way about it

# define regex pattern on values that you like treated as null
# remember to escape the regex character

# defined N/A, 0.0, and - : /, . and - are all escaped with \
# each patter is separated with |
pat = 'N\/A|0\.0|\-'

# replace values defined in pat with np.nan
# check if its null and take the sum

df['col'].replace(pat, np.nan, regex=True).isna().sum()

CodePudding user response:

Using isin([])

mis_val = df.Data.isin([pd.NA, np.nan, None, '-', 'N/A', 0]).sum()

Example:

df = pd.DataFrame({'Data': [1, 2, 3, 'text', pd.NA, 5, 0, 4, '-', 12, 'N/A', 20, np.nan]})
mis_val = df.Data.isin([pd.NA, np.nan, None, '-', 'N/A', 0]).sum()
mis_val_percent = 100 * mis_val / df.size
print(f'Missed values rate is {mis_val_percent:.02f}%')
Missed values rate is 38.46%

Side note. About using pd.NA vs np.nan for pandas

CodePudding user response:

If you want count N/A in particular column:

df["Col_name"].isna().sum()

If you want count N/A in full data frame:

df.isna().sum().sum()
  • Related