Running a function in sapply is returning faulty results when first ID is all missing-CodePudding

I'm having a very odd, specific problem that I'm struggling to google, so I'm hoping I can just show someone.

I've written a function that will fill in some missing data according to a few conditions. For example, for panel data like this:

library(tidyverse)
library(data.table)

dt <- data.frame(id =  c(rep('a', 5),
                         rep('b', 5),
                         rep('c', 5)),
                 var1 = c(rep('', 4), 'bonjour',
                          'bye', NA, 'bye', 'bye', NA,
                          'hi', 'hi', NA, 'hi', 'hi'),
                 year = c(2005:2009,
                          1995:1998, 2002,
                          1995:1999))
dt
    id    var1 year
 1:  a         2005
 2:  a         2006
 3:  a         2007
 4:  a         2008
 5:  a bonjour 2009
 6:  b     bye 1995
 7:  b    <NA> 1996
 8:  b     bye 1997
 9:  b     bye 1998
10:  b    <NA> 2002
11:  c      hi 1995
12:  c      hi 1996
13:  c    <NA> 1997
14:  c      hi 1998
15:  c      hi 1999

I use the following function to update some of the missing values:

fill.in <- function(var, yr, finyr) {
  leadv <- lead(var, n=1, order_by = yr)
  lagv <- lag(var, n=1, order_by = yr)
  
  leadyr <- lead(yr, n=1, order_by = yr)
  lagyr <- lag(yr, n=1, order_by = yr)
  
  # ------- build the updated var w/ sequential conditions
  # keep the var as it is if not missing
  try1 <- ifelse(test = !is.na(var),
                 yes = var,
                 no = NA)
  
  # fill in if the lead and lag match and no more than 2 missing years
  try2 <- ifelse(test = is.na(try1) & leadv == lagv &
                        abs(leadyr-lagyr) <= 3 &
                        !is.na(leadv),
                 yes = leadv,
                 no = try1)
  
  # fill in with the lag if it's the final year of observed data
  ifelse(test = is.na(try2) & yr == finyr &
                abs(yr-lagyr) <= 3 & !is.na(lagv),
         yes = lagv,
         no = try2)
}

After a little bit of set-up, by and large I get good results:

# ------------ Set-up
# real data is big so use data.table
setDT(dt)

dt[, finalyr := max(year), by = id]

# don't want to fill in factor values
dt$var1 <- as.character(dt$var1)

# make empty strings NAs
dt[, var1 := na_if(var1, '')]

# useful for when i'm filling in many variables
fill.in.vs <- c('var1')
fixed.vnames <- paste0('fixed.', fill.in.vs)

# ------------ Call the function and results
dt[, (fixed.vnames) := sapply(.SD,
                              FUN = fill.in,
                              year,
                              finalyr,
                              simplify = FALSE, USE.NAMES = FALSE),
   by = id, .SDcols = fill.in.vs]

# this gives me what I want:
dt
    id    var1 year finalyr fixed.var1
 1:  a    <NA> 2005    2009       <NA>
 2:  a    <NA> 2006    2009       <NA>
 3:  a    <NA> 2007    2009       <NA>
 4:  a    <NA> 2008    2009       <NA>
 5:  a bonjour 2009    2009    bonjour
 6:  b     bye 1995    2002        bye
 7:  b    <NA> 1996    2002        bye
 8:  b     bye 1997    2002        bye
 9:  b     bye 1998    2002        bye
10:  b    <NA> 2002    2002       <NA>
11:  c      hi 1995    1999         hi
12:  c      hi 1996    1999         hi
13:  c    <NA> 1997    1999         hi
14:  c      hi 1998    1999         hi
15:  c      hi 1999    1999         hi

The problem is that when the first set of IDs--e.g. all the 'a' values--have empty strings that I turn into NAs, all values of the "fixed" variable end up NAs as well.

So using that same code but with the following data, I get all NAs in the new variable:

# id of 'a' now is all empty strings in var1:
dt <- data.frame(id =  c(rep('a', 5),
                         rep('b', 5),
                         rep('c', 5)),
                 var1 = c(rep('', 5),
                          'bye', NA, 'bye', 'bye', NA,
                          'hi', 'hi', NA, 'hi', 'hi'),
                 year = c(2005:2009,
                          1995:1998, 2002,
                          1995:1999))

# which results in this final data after running the same code above:
dt
    id var1 year finalyr fixed.var1
 1:  a <NA> 2005    2009         NA
 2:  a <NA> 2006    2009         NA
 3:  a <NA> 2007    2009         NA
 4:  a <NA> 2008    2009         NA
 5:  a <NA> 2009    2009         NA
 6:  b  bye 1995    2002         NA
 7:  b <NA> 1996    2002         NA
 8:  b  bye 1997    2002         NA
 9:  b  bye 1998    2002         NA
10:  b <NA> 2002    2002         NA
11:  c   hi 1995    1999         NA
12:  c   hi 1996    1999         NA
13:  c <NA> 1997    1999         NA
14:  c   hi 1998    1999         NA
15:  c   hi 1999    1999         NA

For brevity I won't show you all the things I've tried, but a few observations about when it happens:

All empty strings in the first ID isn't a problem if I do not convert empty strings to NA.
I only get this result if it's the first ID that has all empty strings; if it's the second set, the results are fine.
How I convert "" to NA doesn't matter, i.e. it's not an issue with na_if because it also happens when I use ifelse.

Overall I'm pretty stumped as to what's happening or how to investigate it further. I would really appreciate any help.

CodePudding user response：

When I run your code, I get this warning:

1: In [.data.table(dt, , :=((fixed.vnames), sapply(.SD, FUN = fill.in, : Coercing 'character' RHS to 'logical' to match the type of the target column (column 0 named '').

I get it twice, once for the second, and for the third group. As it says, the variable fixed.var1 is initialised as logical variable (for the group id==a); values that are added later are then converted to the same class 'logical'.

The major culprit here is your function fill.in(), since e.g.

logicalVar <- fill.in( var=rep(NA,5), yr=2005:2009, finyr=rep(2009,5)); class(logicalVar)

returns a logical variable. So all you need to do is to wrap as.character() around the return of your function.