I'm having a very odd, specific problem that I'm struggling to google, so I'm hoping I can just show someone.
I've written a function that will fill in some missing data according to a few conditions. For example, for panel data like this:
library(tidyverse)
library(data.table)
dt <- data.frame(id = c(rep('a', 5),
rep('b', 5),
rep('c', 5)),
var1 = c(rep('', 4), 'bonjour',
'bye', NA, 'bye', 'bye', NA,
'hi', 'hi', NA, 'hi', 'hi'),
year = c(2005:2009,
1995:1998, 2002,
1995:1999))
dt
id var1 year
1: a 2005
2: a 2006
3: a 2007
4: a 2008
5: a bonjour 2009
6: b bye 1995
7: b <NA> 1996
8: b bye 1997
9: b bye 1998
10: b <NA> 2002
11: c hi 1995
12: c hi 1996
13: c <NA> 1997
14: c hi 1998
15: c hi 1999
I use the following function to update some of the missing values:
fill.in <- function(var, yr, finyr) {
leadv <- lead(var, n=1, order_by = yr)
lagv <- lag(var, n=1, order_by = yr)
leadyr <- lead(yr, n=1, order_by = yr)
lagyr <- lag(yr, n=1, order_by = yr)
# ------- build the updated var w/ sequential conditions
# keep the var as it is if not missing
try1 <- ifelse(test = !is.na(var),
yes = var,
no = NA)
# fill in if the lead and lag match and no more than 2 missing years
try2 <- ifelse(test = is.na(try1) & leadv == lagv &
abs(leadyr-lagyr) <= 3 &
!is.na(leadv),
yes = leadv,
no = try1)
# fill in with the lag if it's the final year of observed data
ifelse(test = is.na(try2) & yr == finyr &
abs(yr-lagyr) <= 3 & !is.na(lagv),
yes = lagv,
no = try2)
}
After a little bit of set-up, by and large I get good results:
# ------------ Set-up
# real data is big so use data.table
setDT(dt)
dt[, finalyr := max(year), by = id]
# don't want to fill in factor values
dt$var1 <- as.character(dt$var1)
# make empty strings NAs
dt[, var1 := na_if(var1, '')]
# useful for when i'm filling in many variables
fill.in.vs <- c('var1')
fixed.vnames <- paste0('fixed.', fill.in.vs)
# ------------ Call the function and results
dt[, (fixed.vnames) := sapply(.SD,
FUN = fill.in,
year,
finalyr,
simplify = FALSE, USE.NAMES = FALSE),
by = id, .SDcols = fill.in.vs]
# this gives me what I want:
dt
id var1 year finalyr fixed.var1
1: a <NA> 2005 2009 <NA>
2: a <NA> 2006 2009 <NA>
3: a <NA> 2007 2009 <NA>
4: a <NA> 2008 2009 <NA>
5: a bonjour 2009 2009 bonjour
6: b bye 1995 2002 bye
7: b <NA> 1996 2002 bye
8: b bye 1997 2002 bye
9: b bye 1998 2002 bye
10: b <NA> 2002 2002 <NA>
11: c hi 1995 1999 hi
12: c hi 1996 1999 hi
13: c <NA> 1997 1999 hi
14: c hi 1998 1999 hi
15: c hi 1999 1999 hi
The problem is that when the first set of IDs--e.g. all the 'a' values--have empty strings that I turn into NA
s, all values of the "fixed" variable end up NA
s as well.
So using that same code but with the following data, I get all NA
s in the new variable:
# id of 'a' now is all empty strings in var1:
dt <- data.frame(id = c(rep('a', 5),
rep('b', 5),
rep('c', 5)),
var1 = c(rep('', 5),
'bye', NA, 'bye', 'bye', NA,
'hi', 'hi', NA, 'hi', 'hi'),
year = c(2005:2009,
1995:1998, 2002,
1995:1999))
# which results in this final data after running the same code above:
dt
id var1 year finalyr fixed.var1
1: a <NA> 2005 2009 NA
2: a <NA> 2006 2009 NA
3: a <NA> 2007 2009 NA
4: a <NA> 2008 2009 NA
5: a <NA> 2009 2009 NA
6: b bye 1995 2002 NA
7: b <NA> 1996 2002 NA
8: b bye 1997 2002 NA
9: b bye 1998 2002 NA
10: b <NA> 2002 2002 NA
11: c hi 1995 1999 NA
12: c hi 1996 1999 NA
13: c <NA> 1997 1999 NA
14: c hi 1998 1999 NA
15: c hi 1999 1999 NA
For brevity I won't show you all the things I've tried, but a few observations about when it happens:
- All empty strings in the first ID isn't a problem if I do not convert empty strings to
NA
. - I only get this result if it's the first ID that has all empty strings; if it's the second set, the results are fine.
- How I convert
""
toNA
doesn't matter, i.e. it's not an issue withna_if
because it also happens when I useifelse
.
Overall I'm pretty stumped as to what's happening or how to investigate it further. I would really appreciate any help.
CodePudding user response:
When I run your code, I get this warning:
1: In
[.data.table(dt, ,
:=((fixed.vnames), sapply(.SD, FUN = fill.in, : Coercing 'character' RHS to 'logical' to match the type of the target column (column 0 named '').
I get it twice, once for the second, and for the third group.
As it says, the variable fixed.var1
is initialised as logical variable (for the group id==a
); values that are added later are then converted to the same class 'logical'.
The major culprit here is your function fill.in()
, since e.g.
logicalVar <- fill.in( var=rep(NA,5), yr=2005:2009, finyr=rep(2009,5)); class(logicalVar)
returns a logical variable.
So all you need to do is to wrap as.character()
around the return of your function.