I have a dataset of peoples blood results in R:
ID Result date
A1 80 01/01/2006
A1 70 01/01/2009
A1 61 01/01/2010
A1 30 01/01/2018
A1 28 01/01/2022
B2 40 01/01/2006
B2 30 01/01/2012
B2 25 01/01/2015
B2 10 01/01/2020
G3 28 01/01/2009
G3 27 01/01/2014
G3 25 01/01/2013
G3 24 01/01/2011
G3 22 01/01/2019
I would like to plot these and then do a linear regression and find the predicted date at which their blood result would hit 15 for each person. In reality there are multiple blood tests per year so I would try and get the median result per year and then plot those and get the predicted year they hit 15.
I can do the median per year and use the lm function along the lines of:
filtered$date_of_bloods <-format(filtered$date_of_bloods,format="%Y")
#split into individual ID groups
a <- with(filtered, split(filtered, list(ID)))
#aggregate median results per year
medianfunc <- function(y) {aggregate(results ~ date_of_bloods, data = y, median)}
medians <- sapply(a, medianfunc)
# do lm per ID cohort and get slope of lines
g<- as.data.frame(medians)
coefLM <- function(x) {coef(lm(date_of_bloods ~ results, data = x))}
coefs<- sapply(g, coefLM)
But I am unsure who to extract and get the prediction for when the slope of the line would hit 15.
Any help would be much appreciated
CodePudding user response:
Here is solution relying on dplyr
, splitting the data into lists to feed to lapply
, where you can define a custom function.
For this simple linear regression, we use the coefficients to calculate X for when Y = 15.
You should round the years to your desired level of precision. Note that this does NOT check the assumptions for linear regression, or report on the goodness-of-fit for your given models.
# Use example data from OP
df = structure(list(ID = c("A1", "A1", "A1", "A1", "A1", "B2",
"B2", "B2", "B2", "G3", "G3", "G3", "G3", "G3"), Result = c(80L, 70L,
61L, 30L, 28L, 40L, 30L, 25L, 10L, 28L, 27L, 25L, 24L, 22L),
date = c("01/01/2006", "01/01/2009", "01/01/2010", "01/01/2018",
"01/01/2022", "01/01/2006", "01/01/2012", "01/01/2015", "01/01/2020",
"01/01/2009", "01/01/2014", "01/01/2013", "01/01/2011", "01/01/2019"
)), class = "data.frame", row.names = c(NA, -14L))
library(dplyr)
library(lubridate)
df %>%
# Create a column holding year for each ID
mutate(date = dmy(date)) %>%
mutate(year = year(date)) %>%
# Make yearly concentrations for each id by year, using e.g. median
group_by(ID, year) %>%
summarise(conc = median(Result), .keep_all = TRUE) %>%
# Split each individual into separate lists
split(f = .$ID) %>%
lapply(., function(x) {
# Create LM object
mod = lm(conc ~ year, data = x)
# Use coefficients to find when y = 15
when_is_15 = (15 - coef(mod)[1]) / coef(mod)[2]
# Wrap up the results
data.frame(ID = unique(x$ID),
pred_year = as.numeric(when_is_15))
}) %>%
# Combine all the lists into a single data frame
bind_rows()
#> ID pred_year
#> 1 A1 2024.246
#> 2 B2 2018.595
#> 3 G3 2035.313
Created on 2022-09-12 with reprex v2.0.2
CodePudding user response:
Here's a base R answer:
filtered <- read.table(text = "ID results date_of_bloods
A1 80 01/01/2006
A1 70 01/01/2009
A1 61 01/01/2010
A1 30 01/01/2018
A1 28 01/01/2022
B2 40 01/01/2006
B2 30 01/01/2012
B2 25 01/01/2015
B2 10 01/01/2020
G3 28 01/01/2009
G3 27 01/01/2014
G3 25 01/01/2013
G3 24 01/01/2011
G3 22 01/01/2019", header = TRUE)
filtered$date_of_bloods <- as.Date(filtered$date_of_bloods, '%m/%d/%Y')
filtered$date_of_bloods <- format(filtered$date_of_bloods,format="%Y") |>
as.integer()
#split into individual ID groups
a <- with(filtered, split(filtered, list(ID)))
#aggregate median results per year
medianfunc <- function(y) {aggregate(results ~ date_of_bloods, data = y, median)}
medians <- sapply(a, medianfunc)
# do lm per ID cohort and get slope of lines
g <- as.data.frame(medians)
coefLM <- function(x) {coef(lm(results ~ date_of_bloods, data = x))}
coefs<- sapply(g, coefLM)
coefs <-
rbind(coefs,
year_for_15 = (15 - coefs["(Intercept)", ]) / coefs["date_of_bloods", ])
Giving:
coefs
A1 B2 G3
(Intercept) 6998.650 4263.381995 953.8239437
date_of_bloods -3.450 -2.104623 -0.4612676
year_for_15 2024.246 2018.595376 2035.3129771
I changed your code for the year to make it an integer, that way the coefLM
function can work in the normal dependent variable ~ independent variable order.
Then it's just y = mx b implies x0 = (15 - b) / m.