Home > Software design >  how to find the min difference from a value in one col to the value from multiple cols
how to find the min difference from a value in one col to the value from multiple cols

Time:05-25

I have a df that looks like following. I would like to find the closest date (from R_1:R_10) that happened before date. Is it a way to calculate the difference in days for R1:R10 to date at once and then pick out the the value we want, and then put it in a new variable cdate?

enter image description here

the output will looks like this:

enter image description here

the cdate will be identified by R_1:R10-date, whichever has the greatest negative value will be the date that we put in cdate

Sample data:

df<-structure(list(id = c("A0003", "A0003", "A0003", "A0003", "A0003", 
"A0003", "A0005", "A0005"), date = structure(c(17774, 17799, 
17805, 17837, 17846, 17873, 17784, 17784), class = "Date"), R_1 = structure(c(17773, 
17773, 17773, 17773, 17773, 17773, 17785, 17785), class = "Date"), 
    R_2 = structure(c(17815, 17815, 17815, 17815, 17815, 17815, 
    17827, 17827), class = "Date"), R_5 = structure(c(17794, 
    17794, 17794, 17794, 17794, 17794, 17806, 17806), class = "Date"), 
    R_10 = structure(c(17871, 17871, 17871, 17871, 17871, 17871, 
    NA, NA), class = "Date")), row.names = c(NA, 8L), class = "data.frame")

CodePudding user response:

With base R, we loop over the 'R_' columns with lapply, replace the values that are greater than 'date' to NA, then use pmax to return the max date value

df$cdate <-  do.call(pmax, c(lapply(df[grep("R_", names(df))], 
     \(x) replace(x, x > df$date, NA)), na.rm = TRUE))

-output

> df
     id       date        R_1        R_2        R_5       R_10      cdate
1 A0003 2018-08-31 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-08-30
2 A0003 2018-09-25 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-09-20
3 A0003 2018-10-01 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-09-20
4 A0003 2018-11-02 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-10-11
5 A0003 2018-11-11 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-10-11
6 A0003 2018-12-08 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-12-06
7 A0005 2018-09-10 2018-09-11 2018-10-23 2018-10-02       <NA>       <NA>
8 A0005 2018-09-10 2018-09-11 2018-10-23 2018-10-02       <NA>       <NA>

CodePudding user response:

We can join the data to the summarized version, found by pivoting the data longer, grouping by id and date, filtering out R_x dates that are later, and summarizing for the last one.

library(dplyr)
df %>%
  left_join(
    df %>%
      tidyr::pivot_longer(-c(id,date)) %>%
      group_by(id, date) %>%
      filter(value <= date) %>%
      summarize(cdate = max(value), .groups = "drop")
  )

Result

Joining, by = c("id", "date")
     id       date        R_1        R_2        R_5       R_10      cdate
1 A0003 2018-08-31 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-08-30
2 A0003 2018-09-25 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-09-20
3 A0003 2018-10-01 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-09-20
4 A0003 2018-11-02 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-10-11
5 A0003 2018-11-11 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-10-11
6 A0003 2018-12-08 2018-08-30 2018-10-11 2018-09-20 2018-12-06 2018-12-06
7 A0005 2018-09-10 2018-09-11 2018-10-23 2018-10-02       <NA>       <NA>
8 A0005 2018-09-10 2018-09-11 2018-10-23 2018-10-02       <NA>       <NA>
  • Related