I am trying to calculate the population at risk of a particular type of cancer by year. I have a data.table
that has information on whether patients had cancer
(1/0), and the date that their cancer was detected cancerDate
. My data spans 2015 to 2021.
I have written a function for this:
add_par_column <- function(dt, year) {
dt[, `:=`(PAR = cancer == 0 | (cancer == 1 & cancerDate >= paste0(year, "-01-01")))]
}
then implemented the function like this:
DT <- add_par_column(DT, 2015)
DT <- add_par_column(DT, 2016)
DT <- add_par_column(DT, 2017)
#etc.
The problem is that the variable PAR
that my function creates gets over-written with each new line of year
that I run instead of keeping the PAR for each year separately in the data.table
.
I have tried to edit the function by adding a prefix to the PAR
variable like this:
add_par_column <- function(dt, year) {
dt[, `:=`(
paste0("PAR", year) = cancer == 0 | (cancer == 1 & cancerDate >= paste0(year, "-01-01"))
)]
}
but I keep getting error messages.
If I were to do this without the function, I should have these new PAR
variables created in the data.table
:
DT <- DT[,
`:=`(
PAR2015 = cancer == 0 |(cancer == 1 & cancerDate >= "2015-01-01"),
PAR2016 = cancer == 0 |(cancer == 1 & cancerDate >= "2016-01-01"),
PAR2017 = cancer == 0 |(cancer == 1 & cancerDate >= "2017-01-01"),
PAR2018 = cancer == 0 |(cancer == 1 & cancerDate >= "2018-01-01"),
PAR2019 = cancer == 0 |(cancer == 1 & cancerDate >= "2019-01-01"),
PAR2020 = cancer == 0 |(cancer == 1 & cancerDate >= "2020-01-01")
PAR2021 = cancer == 0 |(cancer == 1 & cancerDate >= "2021-01-01")
)]
but I am trying to avoid the repetitions.
CodePudding user response:
If we want to keep the PAR
to keep the original as well as update, then create an OR (|
) condition with PAR
column already created
add_par_column <- function(dt, year) {
if(!exists('PAR', dt))
{
dt[, PAR := FALSE]
}
dt[year(cancerDate) == year, PAR := (cancer == 0 |
(cancer == 1 &
cancerDate >= paste0(year, "-01-01")))|PAR]
dt
}
-testing
> add_par_column(DT, 2015)
> DT
cancer cancerDate PAR
1: 0 2015-01-01 TRUE
2: 0 2015-04-01 TRUE
3: 1 2015-07-01 TRUE
4: 0 2015-10-01 TRUE
5: 1 2016-01-01 FALSE
6: 0 2016-04-01 FALSE
7: 0 2016-07-01 FALSE
8: 1 2016-10-01 FALSE
9: 1 2017-01-01 FALSE
10: 1 2017-04-01 FALSE
11: 1 2017-07-01 FALSE
12: 0 2017-10-01 FALSE
13: 1 2018-01-01 FALSE
14: 0 2018-04-01 FALSE
15: 1 2018-07-01 FALSE
16: 0 2018-10-01 FALSE
17: 1 2019-01-01 FALSE
18: 0 2019-04-01 FALSE
19: 0 2019-07-01 FALSE
20: 1 2019-10-01 FALSE
> add_par_column(DT, 2016)
> DT
cancer cancerDate PAR
1: 0 2015-01-01 TRUE
2: 0 2015-04-01 TRUE
3: 1 2015-07-01 TRUE
4: 0 2015-10-01 TRUE
5: 1 2016-01-01 TRUE
6: 0 2016-04-01 TRUE
7: 0 2016-07-01 TRUE
8: 1 2016-10-01 TRUE
9: 1 2017-01-01 FALSE
10: 1 2017-04-01 FALSE
11: 1 2017-07-01 FALSE
12: 0 2017-10-01 FALSE
13: 1 2018-01-01 FALSE
14: 0 2018-04-01 FALSE
15: 1 2018-07-01 FALSE
16: 0 2018-10-01 FALSE
17: 1 2019-01-01 FALSE
18: 0 2019-04-01 FALSE
19: 0 2019-07-01 FALSE
20: 1 2019-10-01 FALSE
data
set.seed(24)
DT <- data.table(cancer = sample(0:1, size = 20, replace = TRUE),
cancerDate = seq(as.Date('2015-01-01'), length.out = 20, by = '3 months'))
CodePudding user response:
You could use the LHS:=RHS
reference semantics instead of the functional form ':='(LHS=RHS)
.
I can't remember seeing the functional form with a calculated LHS
, the error messages you get suggest this isn't allowed.
add_par_column <- function(dt, year) {
dt[, paste0("PAR", year) := cancer == 0 | (cancer == 1 & cancerDate >= paste0(year, "-01-01"))]
}
DT <- add_par_column(DT, 2015)
DT <- add_par_column(DT, 2016)
DT <- add_par_column(DT, 2017)
DT[]
# cancer cancerDate PAR2015 PAR2016 PAR2017
# <int> <Date> <lgcl> <lgcl> <lgcl>
# 1: 0 2015-01-01 TRUE TRUE TRUE
# 2: 0 2015-04-01 TRUE TRUE TRUE
# 3: 1 2015-07-01 TRUE FALSE FALSE
# 4: 0 2015-10-01 TRUE TRUE TRUE
# 5: 1 2016-01-01 TRUE TRUE FALSE
...