I am trying to calculate person-years for my dataset (that contains fake patients data), stratifyng them by age group. My age groups have 5 years incremental ("0-4", "5-9", "10-14", "15-19", ..., "85 " --> for a total of 18 age groups).
Therefore the objective of the code below is to calculate how many person-years every patient has contributed to each age group.
# initialize a matrix with 35 column - one for
# each entry in my dataset - and 18 rows
m <- matrix(data = rep(0,18*35), nrow = 18, ncol = 35)
for (i in 1:nrow(df)){ #loops through my dataset (df) where every row is a different patient
start_date <- as.Date(df$DSTART[i]) # create a start/end/birth date for the every patient
end_date <- as.Date(df$DLAST[i])
birth_date <- as.Date(df$DOB[i])
# create a vector of dates between start and end of follow-up for every patient
seq_date <- seq.Date(from = start_date, to = end_date, by = "days")
# the following print commands allow me to control that the loop is correctly
# identifying the dates defined above
print(paste("The starting age of patients n°", i, "is", round(((start_date-birth_date)/365.25), 0), "and his final age is", round(((seq_date[length(seq_date)]-birth_date)/365.25), 0)))
print(paste("His total FU is", round(((length(seq_date))/365.25), 2), "years, from", seq_date[1], "to", seq_date[length(seq_date)]))
# calculate to which age group every single day of follow-up
# belongs to, iterating through the dates in seq_date, calculating
# how old was the patient during every single day, and then calculating
# the correct age group dividing (age/5) 1 and truncating that to the lower
# integer
for (d in seq_date){
# j is the age group of every day in the seq_date vector
j <- 0
j <- min(18, floor((((seq_date-birth_date)/365.25)/5) 1))
# use the age group for every day in seq_date
# to update the value of the specific element of the matrix m
# where j should be the age group and i the patient
m[j, i] <- (m[j, i] 1) # count patient-days in each age group adding 1
}
# print the column that is associated with the ith patient
# It divides by 365.25 to go from days to year with 2 decimals
print(round((m[, i])/365.25, 2))
cat(" \n")
}
print(round((m/365.25), 0))
In the output 1 picture it shows that the code is correclty iterating through patients, that is correctly creating the seq.Date object (as shown with the print command), but the person-years of each patient (even if correctly measured) are all summed up in the first age group measured for each patient, as if in the for loop j is not changing values.
output 2 I tried to remove the min function from the loop, and the for loop sums all the valued in all the age group for every patient. The error message is a consequence of oen patinent being over 90y during his follow-up and therefore it would be age group 19 (which doesn't exist).
The ideal output should have the person years for each patients correctly summed up in the respective age group. For example, for patient 1 it should divide the 7.47 person years in cathegory 14 ("65-69"), 15 ("70-74"), 16 ("75-79")
CodePudding user response:
@jblood94 is right about fixing your code. If I could make a suggestion, making a dplyr
pipeline is quite a lot faster. The loop on my 35 hypothetical patients took about 10.5 seconds and the dplyr pipeline takes less than 1 second:
library(tidyr)
library(lubridate)
#> Loading required package: timechange
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(519)
df <- tibble(
DOB = ymd(paste0(round(seq(1930, 1980, length=35)), "/1/1")),
DSTART = DOB runif(35, 1, 7500),
DLAST = DSTART runif(35, 100, 7500)
)
out <- df %>%
mutate(ID = 1:n()) %>%
rowwise() %>%
mutate(s = list(seq.Date(DSTART, DLAST, by="days"))) %>%
unnest(s) %>%
mutate(age = s-DOB,
group = as.numeric(floor((age/365.25)/5) 1),
group = ifelse(group > 18, 18, group)) %>%
group_by(ID, DOB, DSTART, DLAST, group) %>%
summarise(n = round(n()/365.25, 0)) %>%
ungroup %>%
arrange(group, ID) %>%
pivot_wider(names_from = "group", names_prefix = "group_", values_from="n", values_fill=0)
#> `summarise()` has grouped output by 'ID', 'DOB', 'DSTART', 'DLAST'. You can
#> override using the `.groups` argument.
out
#> # A tibble: 35 × 12
#> ID DOB DSTART DLAST group_1 group_2 group_3 group_4
#> <int> <date> <date> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 7 1939-01-01 1939-08-16 1941-01-05 1 0 0 0
#> 2 11 1945-01-01 1948-10-20 1968-04-05 1 5 5 5
#> 3 12 1946-01-01 1949-02-09 1965-06-01 2 5 5 4
#> 4 17 1954-01-01 1954-12-18 1955-11-10 1 0 0 0
#> 5 24 1964-01-01 1968-11-08 1981-06-08 0 5 5 2
#> 6 25 1965-01-01 1966-05-05 1983-11-21 4 5 5 4
#> 7 27 1968-01-01 1968-05-25 1971-08-26 3 0 0 0
#> 8 33 1977-01-01 1981-07-08 1986-10-03 0 5 0 0
#> 9 1 1930-01-01 1937-08-24 1955-08-31 0 2 5 5
#> 10 8 1940-01-01 1948-02-25 1965-05-28 0 2 5 5
#> # … with 25 more rows, and 4 more variables: group_5 <dbl>, group_6 <dbl>,
#> # group_7 <dbl>, group_8 <dbl>
Created on 2023-01-18 by the reprex package (v2.0.1)