Home > Enterprise >  Roll join does not give NA values in data.table
Roll join does not give NA values in data.table

Time:11-11

Lets say I have the following data:

library(data.table)           
x1 <- structure(list(age = c(NA, 40L, 78L, 49L, 59L), 
                     age_cat = c(NA, "(39,50]", "(72,83]", "(39,50]", "(57,64]")),
                row.names = c(NA, -5L), class = c("data.table", "data.frame"))  

x2 <- structure(list(age = c(20.5, 25.5, 30.5, 35.5, 40.5, 45.5, 50.5, 
                             55.5, 60.5, 65.5, 70.5, 75.5, 80.5, 85.5, 90.5), 
                     yhat = c(2.04759618619815, 1.48463998876167, 
                              1.20171574722936, 1.05038762605566, 
                              1.00519209126108, 0.996948029536619, 
                              0.945501768666378, 0.834241582084186, 
                              0.731586245690655, 0.727522200367293, 
                              0.816133858934611, 0.817941581611403, 
                              0.498731829342745, 0.104433989674483, 
                              1.59446835516229)), 
                row.names = c(NA, -15L), 
                class = "data.frame")

Then a function is created to join on the nearest age:

join_nearest <- function(dat, reference, x){
  reference <- data.table::data.table(reference)
  dat <- data.table::data.table(dat)
  join <- reference[dat, roll = "nearest", on = x]
  as.data.frame(join)
}

This gives the following solution:

join_nearest(x1, x2, "age")
#>   age      yhat age_cat
#> 1  NA 2.0475962    <NA>
#> 2  40 1.0051921 (39,50]
#> 3  78 0.8179416 (72,83]
#> 4  49 0.9455018 (39,50]
#> 5  59 0.7315862 (57,64]

In case age equals NA I was expecting that yhat would be equal to NA as well. I am wondering why this is not the case, and is there a way to return NA for rows where age is equal to NA?

Created on 2021-11-10 by the reprex package (v2.0.1)

CodePudding user response:

You can do a manual range-based join.

Up front: x1$age is integer but x2$age is numeric, so I'll convert the first to numeric so that data.table doesn't complain about class.

From there, I'll create lo/hi fields in x2 that we can use to join. If you prefer the break points to be other than halfway, that's easy enough to change, over to your needs.

x1[, age := as.numeric(age)]

x1[x2[, c("lo","hi") := .(age - c(age[1],diff(age)/2), age   c(diff(age)/2,Inf))
   ], yhat := i.yhat, on = .(age >= lo, age < hi),
   ][]
#      age age_cat      yhat
#    <num>  <char>     <num>
# 1:    NA    <NA>        NA
# 2:    40 (39,50] 1.0051921
# 3:    78 (72,83] 0.4987318
# 4:    49 (39,50] 0.9455018
# 5:    59 (57,64] 0.7315862

## cleanup
x2[, c("lo","hi") := NULL]

I recognize a momentary side-effect of adding lo and hi, ergo the cleanup. If your data is not large, you can sidestep this by using copy(x2)[,c("lo","hi"):=...] instead, at the expense of memory-inefficiency.

  • Related