Lets say I have the following data:
library(data.table)
x1 <- structure(list(age = c(NA, 40L, 78L, 49L, 59L),
age_cat = c(NA, "(39,50]", "(72,83]", "(39,50]", "(57,64]")),
row.names = c(NA, -5L), class = c("data.table", "data.frame"))
x2 <- structure(list(age = c(20.5, 25.5, 30.5, 35.5, 40.5, 45.5, 50.5,
55.5, 60.5, 65.5, 70.5, 75.5, 80.5, 85.5, 90.5),
yhat = c(2.04759618619815, 1.48463998876167,
1.20171574722936, 1.05038762605566,
1.00519209126108, 0.996948029536619,
0.945501768666378, 0.834241582084186,
0.731586245690655, 0.727522200367293,
0.816133858934611, 0.817941581611403,
0.498731829342745, 0.104433989674483,
1.59446835516229)),
row.names = c(NA, -15L),
class = "data.frame")
Then a function is created to join on the nearest age:
join_nearest <- function(dat, reference, x){
reference <- data.table::data.table(reference)
dat <- data.table::data.table(dat)
join <- reference[dat, roll = "nearest", on = x]
as.data.frame(join)
}
This gives the following solution:
join_nearest(x1, x2, "age")
#> age yhat age_cat
#> 1 NA 2.0475962 <NA>
#> 2 40 1.0051921 (39,50]
#> 3 78 0.8179416 (72,83]
#> 4 49 0.9455018 (39,50]
#> 5 59 0.7315862 (57,64]
In case age
equals NA I was expecting that yhat
would be equal to NA as well. I am wondering why this is not the case, and is there a way to return NA for rows where age
is equal to NA?
Created on 2021-11-10 by the reprex package (v2.0.1)
CodePudding user response:
You can do a manual range-based join.
Up front: x1$age
is integer but x2$age
is numeric, so I'll convert the first to numeric
so that data.table doesn't complain about class.
From there, I'll create lo/hi fields in x2
that we can use to join. If you prefer the break points to be other than halfway, that's easy enough to change, over to your needs.
x1[, age := as.numeric(age)]
x1[x2[, c("lo","hi") := .(age - c(age[1],diff(age)/2), age c(diff(age)/2,Inf))
], yhat := i.yhat, on = .(age >= lo, age < hi),
][]
# age age_cat yhat
# <num> <char> <num>
# 1: NA <NA> NA
# 2: 40 (39,50] 1.0051921
# 3: 78 (72,83] 0.4987318
# 4: 49 (39,50] 0.9455018
# 5: 59 (57,64] 0.7315862
## cleanup
x2[, c("lo","hi") := NULL]
I recognize a momentary side-effect of adding lo
and hi
, ergo the cleanup. If your data is not large, you can sidestep this by using copy(x2)[,c("lo","hi"):=...]
instead, at the expense of memory-inefficiency.