Is there a way to calculate missing values by combining other known values?-CodePudding

I have a dataset with longitudinal data in long format. For each id, I have the number of years since their baseline assessment. For some of the assessments, I have the year it occurred but not all assessments have this information. I am trying to fill in missing years based on the time elapsed between assessments.

   id    years_since_baseline      study_year 
1  1                  3.09              NA
2  1                  3.87              NA
3  1                  4.91              2021
4  1                  6.36              NA
5  2                  0                 2018
6  2                  1.02              NA
7  2                  2.40              NA
8  3                  13.9              NA
9  3                  15.0              NA
10 3                 15.71              2017

Ideally I could calculate the missing study year by using the years_since_baseline (e.g. for row 6, study_year = 2018 1.02 = 2019.02) producing something that looks like:

   id    years_since_baseline      study_year 
1  1                  3.09              2019.08
2  1                  3.87              2019.96
3  1                  4.91              2021
4  1                  6.36              2022.45
5  2                  0                 2018
6  2                  1.02              2019.02
7  2                  2.40              2020.40
8  3                  13.9              2015.19
9  3                  15.0              2016.29
10 3                 15.71              2017

However, the assessment with year information varies between ids - some have it as the first assessment, others in the middle and some at the last assessment. Also, I will have to calculate the time elapsed between each assessment.

I have tried using combinations of group_by, fill and lag to calculate the missing years but with no success.

CodePudding user response：

You can use mutate() along with group_by() to calculate an estimate of the baseline year for all rows in each group. (I include mean() just in case there is more than one row per id with the study year present.)

Then use this column with the years_since_baseline to calculate the missing study_year values.

have <- data.frame(id = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
                   years_since_baseline = c(3.09,  3.87, 4.91, 6.36, 0, 1.02, 2.4, 13.9, 15, 15.71),
                   study_year = c(NA,  NA, 2021, NA, 2018, NA, NA, NA, NA, 2017))

want <- have %>% 
    group_by(id) %>% 
    mutate(
     # calculate estimated baseline year
        est_baseyear = mean(ifelse(!is.na(study_year), study_year - years_since_baseline, NA), na.rm=TRUE),
     # use this to calculate a new study year for rows where it is missing
        study_year = ifelse(is.na(study_year), est_baseyear   years_since_baseline, study_year)
           ) %>% 
     # drop the estimate columm
    select(-est_baseyear)

I think this gives your required output. However, I would have a concern about the fact that the original study_year variable is a whole number so these calculations might be biasing the results toward an earlier year. Maybe use round(est_baseyear years_since_baseline) in the calculation of study_year to mitigate somewhat against this.