Dplyr - choosing value in column based on lowest value in other column in R-CodePudding

I am currently working on a dataset with multiple biopsies per patient ID. I need to find the biopsy result closest to a specific date (individual per patient). A dummy dataset can be seen below


df <- data.frame(m1 = c("1","1","1","2","2","2"), 
                 patodate=c("2013-06-03","2014-01-06","2018-11-23","2004-03-03","2018-06-25","2018-12-19"), 
                 baselinedate=c("2018-11-09","2018-11-09","2018-11-09","2018-07-24","2018-07-24","2018-07-24"),
                 biopsy=c("1","2","3","1","2","3"))

I have then calculated the time difference between patodate and baselinedate

df$patodate <- as.Date(df$patodate)
df$baselinedate <- as.Date(df$baselinedate)

df <- df%>%
  group_by(m1) %>%
  mutate(diff = baselinedate-recdate)

My question is now - I want to add a new column called 'status' which shows (by group m1) the 'biopsy' result with the time difference closest to 0. The end result would be

df <- data.frame(m1 = c("1","1","1","2","2","2"), 
                 patodate=c("2013-06-03","2014-01-06","2018-11-23","2004-03-03","2018-06-25","2018-12-19"), 
                 baselinedate=c("2018-11-09","2018-11-09","2018-11-09","2018-07-24","2018-07-24","2018-07-24"),
                 biopsy=c("1","2","3","1","2","3"),
                 status=c("3","3","3","2","2","2"))

I hope someone understands the issue and is able to help. Many thanks

Kind regards,

Tobias Berg

CodePudding user response：

You can get index of minimum absolute value of difference between the dates for each group.

library(dplyr)

df %>%
  group_by(m1) %>%
  mutate(status = which.min(abs(patodate - baselinedate))) %>%
  ungroup

#  m1    patodate   baselinedate biopsy status
#  <chr> <date>     <date>       <chr>   <int>
#1 1     2013-06-03 2018-11-09   1           3
#2 1     2014-01-06 2018-11-09   2           3
#3 1     2018-11-23 2018-11-09   3           3
#4 2     2004-03-03 2018-07-24   1           2
#5 2     2018-06-25 2018-07-24   2           2
#6 2     2018-12-19 2018-07-24   3           2

CodePudding user response：

Here is an alternative way, with a hepler column:

library(dplyr)
library(lubridate)
df %>% 
  group_by(m1) %>% 
  mutate(across(contains("date"), ymd),
         helper = abs(difftime(baselinedate,patodate))) %>% 
  mutate(status = biopsy[helper==min(helper)], .keep="unused")

m1    patodate   baselinedate status
  <chr> <date>     <date>       <chr> 
1 1     2013-06-03 2018-11-09   3     
2 1     2014-01-06 2018-11-09   3     
3 1     2018-11-23 2018-11-09   3     
4 2     2004-03-03 2018-07-24   2     
5 2     2018-06-25 2018-07-24   2     
6 2     2018-12-19 2018-07-24   2