I have a dataset with longitudinal data in long format. For each id, I have the number of years since their baseline assessment. For some of the assessments, I have the year it occurred but not all assessments have this information. I am trying to fill in missing years based on the time elapsed between assessments.
id years_since_baseline study_year
1 1 3.09 NA
2 1 3.87 NA
3 1 4.91 2021
4 1 6.36 NA
5 2 0 2018
6 2 1.02 NA
7 2 2.40 NA
8 3 13.9 NA
9 3 15.0 NA
10 3 15.71 2017
Ideally I could calculate the missing study year by using the years_since_baseline (e.g. for row 6, study_year = 2018 1.02 = 2019.02) producing something that looks like:
id years_since_baseline study_year
1 1 3.09 2019.08
2 1 3.87 2019.96
3 1 4.91 2021
4 1 6.36 2022.45
5 2 0 2018
6 2 1.02 2019.02
7 2 2.40 2020.40
8 3 13.9 2015.19
9 3 15.0 2016.29
10 3 15.71 2017
However, the assessment with year information varies between ids - some have it as the first assessment, others in the middle and some at the last assessment. Also, I will have to calculate the time elapsed between each assessment.
I have tried using combinations of group_by, fill and lag to calculate the missing years but with no success.
CodePudding user response:
You can use mutate()
along with group_by()
to calculate an estimate of the baseline year for all rows in each group. (I include mean()
just in case there is more than one row per id with the study year present.)
Then use this column with the years_since_baseline
to calculate the missing study_year
values.
have <- data.frame(id = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
years_since_baseline = c(3.09, 3.87, 4.91, 6.36, 0, 1.02, 2.4, 13.9, 15, 15.71),
study_year = c(NA, NA, 2021, NA, 2018, NA, NA, NA, NA, 2017))
want <- have %>%
group_by(id) %>%
mutate(
# calculate estimated baseline year
est_baseyear = mean(ifelse(!is.na(study_year), study_year - years_since_baseline, NA), na.rm=TRUE),
# use this to calculate a new study year for rows where it is missing
study_year = ifelse(is.na(study_year), est_baseyear years_since_baseline, study_year)
) %>%
# drop the estimate columm
select(-est_baseyear)
I think this gives your required output. However, I would have a concern about the fact that the original study_year
variable is a whole number so these calculations might be biasing the results toward an earlier year. Maybe use round(est_baseyear years_since_baseline)
in the calculation of study_year
to mitigate somewhat against this.