Home > Software engineering >  R data.table does not recognize reference to input column passed as a call to a function
R data.table does not recognize reference to input column passed as a call to a function

Time:06-29

I am calculating minimum and maximum age decade for a research cohort using R data.table syntax. Coding the two calculation as separate data.table chains works. When I convert the code to a function & pass the input and output columns to the function, data.table recognizes the reference to the output column, but not to the input column. I reduced the following code to provide an example. Suggestions?

strong text### calculate min age and max age decades

library(data.table)
c1 = data.table(
  min_age = c(18, 28, 30), 
  max_age = c(19, 31, 41)
)
head(c1)
```{r}
c1[min_age < 20, min_age_dec := 1][min_age >= 20 & min_age < 30, min_age_dec := 2][min_age >= 30 & min_age < 40, min_age_dec := 3][min_age >= 40, min_age_dec := 4]
c1[max_age < 20, max_age_dec := 1][max_age >= 20 & max_age < 30, max_age_dec := 2][max_age >= 30 & max_age < 40, max_age_dec := 3][max_age >= 40, max_age_dec := 4]
head(c1)
mmfun <- function(dt, in_c, out_c) {
  dt[in_c < 20, (out_c) := 1][in_c >= 20 & in_c < 30, (out_c) := 2][in_c >= 30 & in_c < 40, (out_c) := 3][in_c >= 40, (out_c) := 4]
}
mmfun(c1, "min_age", "min_age_dec")
mmfun(c1, "max_age", "max_age_dec")
head(c1)
[enter image description here][1]


  [1]: https://i.stack.imgur.com/CobPx.jpg

CodePudding user response:

With dev version of data.table (v1.14.3) you could use the env parameter, see programming on data.table:

data.table::update.dev.pkg()

mmfun <- function(dt, in_c, out_c) {
  dt[in_c < 20, (out_c) := 1,env=list(in_c=in_c)][
     in_c >= 20 & in_c < 30, (out_c) := 2,env=list(in_c=in_c)][
     in_c >= 30 & in_c < 40, (out_c) := 3,env=list(in_c=in_c)][
     in_c >= 40, (out_c) := 4,env=list(in_c=in_c)]
}
mmfun(c1, "min_age", "min_age_dec")
mmfun(c1, "max_age", "max_age_dec")
head(c1)

   min_age max_age min_age_dec max_age_dec
1:      18      19           1           1
2:      28      31           2           3
3:      30      41           3           4

To simplify code, you could use fcase:

mmfun <- function(dt, in_c, out_c) {
  dt[, (out_c) := fcase(in_c<20,1,in_c<30,2,in_c<40,3,in_c>=40,4)
     , env = list(in_c=in_c)]
}

CodePudding user response:

Here is one possible way to solve your problem using the built-in function findInterval and data.table package. Note that I added a default value to the argument out_c (in case the output column names are obtained by appending the _dec to the input column names).

mmfun <- function(dt, in_c, out_c = paste0(in_c, "_dec")) {
  dt[, (out_c) := lapply(.SD, findInterval, c(0, 20, 30, 40)), .SDcols=in_c]
}

mmfun(c1, c("min_age", "max_age"))

#    min_age max_age min_age_dec max_age_dec
# 1:      18      19           1           1
# 2:      28      31           2           3
# 3:      30      41           3           4

In the thresholds vector c(0, 20, 30, 40), I use 0 because no age value could be lower than 0, and also and mainly to allow the count of intervals to start at 1 (and not at 0). If you planned to use this function on variables that could have negative values, then replace 0 with -Inf.

  • Related