Home > Back-end >  Predicting when an output might happen in time in R
Predicting when an output might happen in time in R

Time:09-13

I have a dataset of peoples blood results in R:

        ID Result date    
        A1 80     01/01/2006
        A1 70     01/01/2009
        A1 61     01/01/2010
        A1 30     01/01/2018
        A1 28     01/01/2022
        B2 40     01/01/2006
        B2 30     01/01/2012
        B2 25     01/01/2015
        B2 10     01/01/2020
        G3 28     01/01/2009
        G3 27     01/01/2014
        G3 25     01/01/2013
        G3 24     01/01/2011
        G3 22     01/01/2019

I would like to plot these and then do a linear regression and find the predicted date at which their blood result would hit 15 for each person. In reality there are multiple blood tests per year so I would try and get the median result per year and then plot those and get the predicted year they hit 15.

I can do the median per year and use the lm function along the lines of:

filtered$date_of_bloods <-format(filtered$date_of_bloods,format="%Y")
#split into individual ID groups
a <- with(filtered, split(filtered, list(ID)))

#aggregate median results per year 
medianfunc <- function(y) {aggregate(results ~ date_of_bloods, data = y, median)}
medians <- sapply(a, medianfunc)

# do lm per ID cohort and get slope of lines 
g<- as.data.frame(medians)
coefLM <- function(x) {coef(lm(date_of_bloods ~ results, data = x))}
coefs<- sapply(g, coefLM)

But I am unsure who to extract and get the prediction for when the slope of the line would hit 15.

Any help would be much appreciated

CodePudding user response:

Here is solution relying on dplyr, splitting the data into lists to feed to lapply, where you can define a custom function. For this simple linear regression, we use the coefficients to calculate X for when Y = 15.

You should round the years to your desired level of precision. Note that this does NOT check the assumptions for linear regression, or report on the goodness-of-fit for your given models.

# Use example data from OP
df = structure(list(ID = c("A1", "A1", "A1", "A1", "A1", "B2", 
"B2", "B2", "B2", "G3", "G3", "G3", "G3", "G3"), Result = c(80L, 70L, 
61L, 30L, 28L, 40L, 30L, 25L, 10L, 28L, 27L, 25L, 24L, 22L), 
date = c("01/01/2006", "01/01/2009", "01/01/2010", "01/01/2018", 
"01/01/2022", "01/01/2006", "01/01/2012", "01/01/2015", "01/01/2020", 
"01/01/2009", "01/01/2014", "01/01/2013", "01/01/2011", "01/01/2019"
)), class = "data.frame", row.names = c(NA, -14L))

library(dplyr)
library(lubridate)

df %>% 
  # Create a column holding year for each ID
  mutate(date = dmy(date)) %>% 
  mutate(year = year(date)) %>% 
  # Make yearly concentrations for each id by year, using e.g. median
  group_by(ID, year) %>% 
  summarise(conc = median(Result), .keep_all = TRUE) %>% 
  # Split each individual into separate lists
  split(f = .$ID) %>% 
  lapply(., function(x) {
    
    # Create LM object
    mod = lm(conc ~ year, data = x)
    
    # Use coefficients to find when y = 15
    when_is_15 = (15 - coef(mod)[1]) / coef(mod)[2]
    
    # Wrap up the results
    data.frame(ID = unique(x$ID),
               pred_year = as.numeric(when_is_15))
    
  }) %>% 
  # Combine all the lists into a single data frame
  bind_rows()

#>   ID pred_year
#> 1 A1  2024.246
#> 2 B2  2018.595
#> 3 G3  2035.313

Created on 2022-09-12 with reprex v2.0.2

CodePudding user response:

Here's a base R answer:

filtered <- read.table(text = "ID results date_of_bloods    
                A1 80     01/01/2006
                A1 70     01/01/2009
                A1 61     01/01/2010
                A1 30     01/01/2018
                A1 28     01/01/2022
                B2 40     01/01/2006
                B2 30     01/01/2012
                B2 25     01/01/2015
                B2 10     01/01/2020
                G3 28     01/01/2009
                G3 27     01/01/2014
                G3 25     01/01/2013
                G3 24     01/01/2011
                G3 22     01/01/2019", header = TRUE)
filtered$date_of_bloods <- as.Date(filtered$date_of_bloods, '%m/%d/%Y')

filtered$date_of_bloods <- format(filtered$date_of_bloods,format="%Y") |> 
  as.integer()
#split into individual ID groups
a <- with(filtered, split(filtered, list(ID)))

#aggregate median results per year 
medianfunc <- function(y) {aggregate(results ~ date_of_bloods, data = y, median)}
medians <- sapply(a, medianfunc)

# do lm per ID cohort and get slope of lines 
g <- as.data.frame(medians)
coefLM <- function(x) {coef(lm(results ~ date_of_bloods, data = x))}
coefs<- sapply(g, coefLM)  

coefs <- 
  rbind(coefs, 
        year_for_15 = (15 - coefs["(Intercept)", ]) / coefs["date_of_bloods", ])

Giving:

coefs
                     A1          B2           G3
(Intercept)    6998.650 4263.381995  953.8239437
date_of_bloods   -3.450   -2.104623   -0.4612676
year_for_15    2024.246 2018.595376 2035.3129771

I changed your code for the year to make it an integer, that way the coefLM function can work in the normal dependent variable ~ independent variable order. Then it's just y = mx b implies x0 = (15 - b) / m.

  • Related