Home > Software engineering >  How to scan for and flag miscoded missing values in R (and other languages)?
How to scan for and flag miscoded missing values in R (and other languages)?

Time:11-18

So obviously when prepping data for analysis, one of the first things to do is to look for missing values. It's nice if the values are already in a format recognized as missing by the language (e.g. NA in R), but sometimes you get values like these that are quite possibly just miscoded missing value indicators:

  1. "n/a"
  2. "not available"
  3. "incomplete"
  4. "unknown"
  5. "null"
  6. "nil"
  7. "not provided"
  8. " " (blank spaces)
  9. 99999999
  10. 9999-01-01
  11. "000-000-0000"
  12. "*"
  13. "-"
  14. [etc.]

My question: are there any packages can scan values in a dataframe and flags the ones that may be miscoded missing values, such as the ones above?

The only package I know of right now that does this is dataReporter in R via its identifyMissing function, but of the above values, only " ", "-", 99999999, and 9999-01-01 are detected as being potentially missing using this. I'm hoping for something more comprehensive than this, even if that results in some additional false negatives.

I primarily work in R but would be happy to have resources for other languages as well.

CodePudding user response:

Using tidyverse, the easiest would probably to make a vector with all the possible missing value alternatives then use this vector to filter (or recode or mutate coupled with an ifelse statement).

This would allow you to make your own list of values that potentially represent miscoded values. As you point out, there can be a multitude of values that represent miscoded missing values, and maybe the easiest is to use a custom vector?

df <- tibble(var_1 = c("missing", "N/A", "c", "c", "null"))

missing_synonyms <- c("missing", "N/A", "null")

df %>% filter(var_1 %in% missing_synonyms)

df %>% mutate(flag = if_else(var_1 %in% missing_synonyms, "missing", "not missing"))


#Gives the following result : 

## A tibble: 5 x 2
  var_1   flag       
  <chr>   <chr>      
1 missing missing    
2 N/A     missing    
3 c       not missing
4 c       not missing
5 null    missing  

CodePudding user response:

If I did not misunderstand you and that you are aware of the values to be flagged as missing then here's a solution.

require(data.table)

# dummy table
df <- data.table( col = c('n/a', 'a', 'not available')); df

             col
1:           n/a
2:             a
3: not available



# a vector of target values
x <- c('n/a', 'not available')
x <- paste0(x, collapse='|')



# create flagging col
df[, flag_missing := grepl(x, col,perl=T)][]

             col flag_missing
1:           n/a         TRUE
2:             a        FALSE
3: not available         TRUE
  • Related