I'm going to give a lot of background (on which I welcome suggestions to improve) that isn't strictly necessary, so you can scroll to the end if you just want to see my main question.
Background
I have a dataset where I'd like to fill in missing values for several variables according to a few rules. Some example data:
library(data.table)
library(tidyverse)
dt <- data.frame(id = c('a', 'a', 'a', 'a', 'a',
'b', 'b', 'b', 'b', 'b'),
var1 = c(1, 1, NA, 1, NA,
NA, 5, 5, NA, 5),
var2 = c('hi', 'hi', NA, 'hi', 'hi',
'bye', NA, 'bye', 'bye', NA),
year = c(2005:2009,
1995:1998, 2002))
Note that the real data is big: 10s of millions of observations, at least 30-40 variables, and I'll be filling in around 10 columns or so, hence the data.table
approach.
I currently fill them in using a function that finds leads and lags and checks my conditions, like so:
fill.in <- function(var, yr, finyr) {
leadv <- lead(var, order_by = yr)
lagv <- lag(var, order_by = yr)
leadyr <- lead(yr, order_by = yr)
lagyr <- lag(yr, order_by = yr)
# ------- build the updated var w/ sequential conditions
# keep the var as it is if not missing
try1 <- ifelse(test = !is.na(var),
yes = var,
no = NA)
# fill in if the lead and lag match and no more than 2 missing years
try2 <- ifelse(test = is.na(try1) & leadv == lagv &
abs(leadyr-lagyr) <= 3 &
!is.na(leadv),
yes = leadv,
no = try1)
# fill in with the lag if it's the final year of observed data
ifelse(test = is.na(try2) & yr == finyr &
abs(yr-lagyr) <= 3 & !is.na(lagv),
yes = lagv,
no = try2)
}
I call this using data.table
's .SD
:
setDT(dt)
dt[, finalyr := max(year), by = id]
# variables to iterate over and the new column names
fill.in.vs <- c('var1', 'var2')
fixed.vnames <- paste0('fixed.', fill.in.vs)
# I don't want factor values, so make strings characters
dt$var2 <- as.character(dt$var2)
dt[, (fixed.vnames) := sapply(.SD,
FUN = fill.in,
yr = year,
finyr = finalyr,
simplify = FALSE, USE.NAMES = FALSE),
by = id, .SDcols = fill.in.vs]
This gives me the data I want:
id var1 var2 year finalyr fixed.var1 fixed.var2
1: a 1 hi 2005 2009 1 hi
2: a 1 hi 2006 2009 1 hi
3: a NA <NA> 2007 2009 1 hi
4: a 1 hi 2008 2009 1 hi
5: a NA hi 2009 2009 1 hi
6: b NA bye 1995 2002 NA bye
7: b 5 <NA> 1996 2002 5 bye
8: b 5 bye 1997 2002 5 bye
9: b NA bye 1998 2002 NA bye
10: b 5 <NA> 2002 2002 5 <NA>
Main question
Ok now to my actual question: I want an indicator of rows that have been updated. I don't think the way I'm updating, with data.table
and .SD
, allows for making paired variables (by which I mean producing two or three different variables In each iteration), so I thought I would just flag the observations where essentially original.var != fixed.var
.
This runs into issues because R doesn't like comparing values to NA
s. (See here for example.)
I've tried to implement some solutions for this but I keep running into issues with trying to do this over a list of variables. For example, the following works to identify which are the same or not:
# this works:
dt$updated.var1 <- mapply(identical, dt$var1, dt$fixed.var1,
SIMPLIFY = FALSE, USE.NAMES = FALSE)
dt[,c('var1', 'fixed.var1', 'updated.var1')]
var1 fixed.var1 updated.var1
1: 1 1 TRUE
2: 1 1 TRUE
3: NA 1 FALSE
4: 1 1 TRUE
5: NA 1 FALSE
6: NA NA TRUE
7: 5 5 TRUE
8: 5 5 TRUE
9: NA NA TRUE
10: 5 5 TRUE
But when I try to use it with data.table
and lists of variables it returns all FALSE
for both variables:
# indicate updated observations
update.names <- paste0('updated.', fill.in.vs)
# this doesn't work how I want it to:
dt[, (update.names) := mapply(identical,
.SD,
fixed.vnames,
SIMPLIFY = FALSE, USE.NAMES = FALSE),
.SDcols = fill.in.vs]
dt[,c('var1', 'fixed.var1', 'updated.var1')]
var1 fixed.var1 updated.var1
1: 1 1 FALSE
2: 1 1 FALSE
3: NA 1 FALSE
4: 1 1 FALSE
5: NA 1 FALSE
6: NA NA FALSE
7: 5 5 FALSE
8: 5 5 FALSE
9: NA NA FALSE
10: 5 5 FALSE
I'm not certain what the issue is, but I'm guessing it has to do with my attempt to get mapply
to recognize the list of the "fixed" variables as variables to iterate over.
Any suggestions would be greatly appreciated.
CodePudding user response:
You can write a custom function to handle this and use Map
to apply the function for each pair.
library(data.table)
new_cols <- function(x, y) {
z <- x != y | is.na(x) & is.na(y)
z[is.na(z)] <- TRUE
z
}
update.names <- paste0('updated.', fill.in.vs)
dt[, (update.names) := Map(new_cols, dt[, ..fixed.vnames], dt[, ..fill.in.vs])]
dt
# id var1 var2 year finalyr fixed.var1 fixed.var2 updated.var1 updated.var2
# 1: a 1 hi 2005 2009 1 hi FALSE FALSE
# 2: a 1 hi 2006 2009 1 hi FALSE FALSE
# 3: a NA <NA> 2007 2009 1 hi TRUE TRUE
# 4: a 1 hi 2008 2009 1 hi FALSE FALSE
# 5: a NA hi 2009 2009 1 hi TRUE FALSE
# 6: b NA bye 1995 2002 NA bye TRUE FALSE
# 7: b 5 <NA> 1996 2002 5 bye FALSE TRUE
# 8: b 5 bye 1997 2002 5 bye FALSE FALSE
# 9: b NA bye 1998 2002 NA bye TRUE FALSE
#10: b 5 <NA> 2002 2002 5 <NA> FALSE TRUE
You might get warnings in the console, please ignore them for now. There is an open issue regarding it. https://github.com/Rdatatable/data.table/issues/2988