Home > Software design >  How to compare to NA when iterating over multiple variables
How to compare to NA when iterating over multiple variables

Time:09-21

I'm going to give a lot of background (on which I welcome suggestions to improve) that isn't strictly necessary, so you can scroll to the end if you just want to see my main question.

Background

I have a dataset where I'd like to fill in missing values for several variables according to a few rules. Some example data:

library(data.table)
library(tidyverse)

dt <- data.frame(id = c('a', 'a', 'a', 'a', 'a',
                        'b', 'b', 'b', 'b', 'b'),
                 var1 = c(1, 1, NA, 1, NA,
                          NA, 5, 5, NA, 5),
                 var2 = c('hi', 'hi', NA, 'hi', 'hi',
                          'bye', NA, 'bye', 'bye', NA),
                 year = c(2005:2009,
                          1995:1998, 2002))

Note that the real data is big: 10s of millions of observations, at least 30-40 variables, and I'll be filling in around 10 columns or so, hence the data.table approach.

I currently fill them in using a function that finds leads and lags and checks my conditions, like so:

fill.in <- function(var, yr, finyr) {
  leadv <- lead(var, order_by = yr)
  lagv <- lag(var, order_by = yr)
  
  leadyr <- lead(yr, order_by = yr)
  lagyr <- lag(yr, order_by = yr)
  
  # ------- build the updated var w/ sequential conditions
  # keep the var as it is if not missing
  try1 <- ifelse(test = !is.na(var),
                 yes = var,
                 no = NA)
  
  # fill in if the lead and lag match and no more than 2 missing years
  try2 <- ifelse(test = is.na(try1) & leadv == lagv &
                        abs(leadyr-lagyr) <= 3 &
                        !is.na(leadv),
                 yes = leadv,
                 no = try1)
  
  # fill in with the lag if it's the final year of observed data
  ifelse(test = is.na(try2) & yr == finyr &
                abs(yr-lagyr) <= 3 & !is.na(lagv),
         yes = lagv,
         no = try2)
}

I call this using data.table's .SD:

setDT(dt)

dt[, finalyr := max(year), by = id]

# variables to iterate over and the new column names
fill.in.vs <- c('var1', 'var2')
fixed.vnames <- paste0('fixed.', fill.in.vs)

# I don't want factor values, so make strings characters
dt$var2 <- as.character(dt$var2)

dt[, (fixed.vnames) := sapply(.SD,
                              FUN = fill.in,
                              yr = year,
                              finyr = finalyr,
                              simplify = FALSE, USE.NAMES = FALSE),
   by = id, .SDcols = fill.in.vs]

This gives me the data I want:

    id var1 var2 year finalyr fixed.var1 fixed.var2
 1:  a    1   hi 2005    2009          1         hi
 2:  a    1   hi 2006    2009          1         hi
 3:  a   NA <NA> 2007    2009          1         hi
 4:  a    1   hi 2008    2009          1         hi
 5:  a   NA   hi 2009    2009          1         hi
 6:  b   NA  bye 1995    2002         NA        bye
 7:  b    5 <NA> 1996    2002          5        bye
 8:  b    5  bye 1997    2002          5        bye
 9:  b   NA  bye 1998    2002         NA        bye
10:  b    5 <NA> 2002    2002          5       <NA>

Main question

Ok now to my actual question: I want an indicator of rows that have been updated. I don't think the way I'm updating, with data.table and .SD, allows for making paired variables (by which I mean producing two or three different variables In each iteration), so I thought I would just flag the observations where essentially original.var != fixed.var.

This runs into issues because R doesn't like comparing values to NAs. (See here for example.)

I've tried to implement some solutions for this but I keep running into issues with trying to do this over a list of variables. For example, the following works to identify which are the same or not:

# this works:
dt$updated.var1 <- mapply(identical, dt$var1, dt$fixed.var1,
                          SIMPLIFY = FALSE, USE.NAMES = FALSE)

dt[,c('var1', 'fixed.var1', 'updated.var1')]
    var1 fixed.var1 updated.var1
 1:    1          1         TRUE
 2:    1          1         TRUE
 3:   NA          1        FALSE
 4:    1          1         TRUE
 5:   NA          1        FALSE
 6:   NA         NA         TRUE
 7:    5          5         TRUE
 8:    5          5         TRUE
 9:   NA         NA         TRUE
10:    5          5         TRUE

But when I try to use it with data.table and lists of variables it returns all FALSE for both variables:

# indicate updated observations
update.names <- paste0('updated.', fill.in.vs)

# this doesn't work how I want it to:
dt[, (update.names) := mapply(identical,
                              .SD,
                              fixed.vnames,
                              SIMPLIFY = FALSE, USE.NAMES = FALSE),
   .SDcols = fill.in.vs]

dt[,c('var1', 'fixed.var1', 'updated.var1')]
    var1 fixed.var1 updated.var1
 1:    1          1        FALSE
 2:    1          1        FALSE
 3:   NA          1        FALSE
 4:    1          1        FALSE
 5:   NA          1        FALSE
 6:   NA         NA        FALSE
 7:    5          5        FALSE
 8:    5          5        FALSE
 9:   NA         NA        FALSE
10:    5          5        FALSE

I'm not certain what the issue is, but I'm guessing it has to do with my attempt to get mapply to recognize the list of the "fixed" variables as variables to iterate over.

Any suggestions would be greatly appreciated.

CodePudding user response:

You can write a custom function to handle this and use Map to apply the function for each pair.

library(data.table)

new_cols <- function(x, y) {
  z <- x != y | is.na(x) & is.na(y)
  z[is.na(z)] <- TRUE
  z
}

update.names <- paste0('updated.', fill.in.vs)

dt[, (update.names) := Map(new_cols, dt[, ..fixed.vnames], dt[, ..fill.in.vs])]
dt

#    id var1 var2 year finalyr fixed.var1 fixed.var2 updated.var1 updated.var2
# 1:  a    1   hi 2005    2009          1         hi        FALSE        FALSE
# 2:  a    1   hi 2006    2009          1         hi        FALSE        FALSE
# 3:  a   NA <NA> 2007    2009          1         hi         TRUE         TRUE
# 4:  a    1   hi 2008    2009          1         hi        FALSE        FALSE
# 5:  a   NA   hi 2009    2009          1         hi         TRUE        FALSE
# 6:  b   NA  bye 1995    2002         NA        bye         TRUE        FALSE
# 7:  b    5 <NA> 1996    2002          5        bye        FALSE         TRUE
# 8:  b    5  bye 1997    2002          5        bye        FALSE        FALSE
# 9:  b   NA  bye 1998    2002         NA        bye         TRUE        FALSE
#10:  b    5 <NA> 2002    2002          5       <NA>        FALSE         TRUE

You might get warnings in the console, please ignore them for now. There is an open issue regarding it. https://github.com/Rdatatable/data.table/issues/2988

  • Related