I am calculating minimum and maximum age decade for a research cohort using R data.table syntax. Coding the two calculation as separate data.table chains works. When I convert the code to a function & pass the input and output columns to the function, data.table recognizes the reference to the output column, but not to the input column. I reduced the following code to provide an example. Suggestions?
strong text### calculate min age and max age decades
library(data.table)
c1 = data.table(
min_age = c(18, 28, 30),
max_age = c(19, 31, 41)
)
head(c1)
```{r}
c1[min_age < 20, min_age_dec := 1][min_age >= 20 & min_age < 30, min_age_dec := 2][min_age >= 30 & min_age < 40, min_age_dec := 3][min_age >= 40, min_age_dec := 4]
c1[max_age < 20, max_age_dec := 1][max_age >= 20 & max_age < 30, max_age_dec := 2][max_age >= 30 & max_age < 40, max_age_dec := 3][max_age >= 40, max_age_dec := 4]
head(c1)
mmfun <- function(dt, in_c, out_c) {
dt[in_c < 20, (out_c) := 1][in_c >= 20 & in_c < 30, (out_c) := 2][in_c >= 30 & in_c < 40, (out_c) := 3][in_c >= 40, (out_c) := 4]
}
mmfun(c1, "min_age", "min_age_dec")
mmfun(c1, "max_age", "max_age_dec")
head(c1)
[enter image description here][1]
[1]: https://i.stack.imgur.com/CobPx.jpg
CodePudding user response:
With dev version of data.table
(v1.14.3) you could use the env
parameter, see programming on data.table:
data.table::update.dev.pkg()
mmfun <- function(dt, in_c, out_c) {
dt[in_c < 20, (out_c) := 1,env=list(in_c=in_c)][
in_c >= 20 & in_c < 30, (out_c) := 2,env=list(in_c=in_c)][
in_c >= 30 & in_c < 40, (out_c) := 3,env=list(in_c=in_c)][
in_c >= 40, (out_c) := 4,env=list(in_c=in_c)]
}
mmfun(c1, "min_age", "min_age_dec")
mmfun(c1, "max_age", "max_age_dec")
head(c1)
min_age max_age min_age_dec max_age_dec
1: 18 19 1 1
2: 28 31 2 3
3: 30 41 3 4
To simplify code, you could use fcase
:
mmfun <- function(dt, in_c, out_c) {
dt[, (out_c) := fcase(in_c<20,1,in_c<30,2,in_c<40,3,in_c>=40,4)
, env = list(in_c=in_c)]
}
CodePudding user response:
Here is one possible way to solve your problem using the built-in function findInterval
and data.table package. Note that I added a default value to the argument out_c
(in case the output column names are obtained by appending the _dec
to the input column names).
mmfun <- function(dt, in_c, out_c = paste0(in_c, "_dec")) {
dt[, (out_c) := lapply(.SD, findInterval, c(0, 20, 30, 40)), .SDcols=in_c]
}
mmfun(c1, c("min_age", "max_age"))
# min_age max_age min_age_dec max_age_dec
# 1: 18 19 1 1
# 2: 28 31 2 3
# 3: 30 41 3 4
In the thresholds vector c(0, 20, 30, 40)
, I use 0 because no age value could be lower than 0, and also and mainly to allow the count of intervals to start at 1 (and not at 0). If you planned to use this function on variables that could have negative values, then replace 0
with -Inf
.