Home > database >  How to create an indicator variable for the lowest and highest date with skipping rule in R?
How to create an indicator variable for the lowest and highest date with skipping rule in R?

Time:04-26

I want an indicator variable that tells me if the date is the lowest or highest when its group by ID. However, I don't want it to count anything that has practice as 1. This is what the data frame looks now and what I want it to look like.

ID date practice
1 02-27-2020 1
1 04-21-2021 0
1 06-24-2022 0
2 03-21-2019 0
2 09-19-2020 0
2 01-21-2021 0

End product:

ID date practice lowest highest
1 02-27-2020 1 0 0
1 04-21-2021 0 1 0
1 06-24-2022 0 0 1
2 03-21-2019 0 1 0
2 09-19-2020 0 0 0
2 01-21-2021 0 0 1

CodePudding user response:

This code delivered the desired solution using the tidyverse. Note that I had to force date into a date format; by default the mm-dd-YYYY format gets read as character, which causes min(date) and max(date) to read different values as min and max.

x<-'
ID  date    practice
1   02-27-2020  1
1   04-21-2021  0
1   06-24-2022  0
2   03-21-2019  0
2   09-19-2020  0
2   01-21-2021  0'

df1 <- read.table(textConnection(x), header = TRUE)
library(tidyverse)

df1$date <- as.Date(df1$date, format = "%m-%d-%Y")

desired_result <- df1 %>%
  group_by(ID) %>%
  mutate(
    lowest = ifelse(date == min(date[practice == 0]), 1, 0),
    highest = ifelse(date == max(date[practice == 0]), 1, 0)
  )

desired_result
# A tibble: 6 × 5
# Groups:   ID [2]
     ID date       practice lowest highest
  <int> <date>        <int>  <dbl>   <dbl>
1     1 2020-02-27        1      0       0
2     1 2021-04-21        0      1       0
3     1 2022-06-24        0      0       1
4     2 2019-03-21        0      1       0
5     2 2020-09-19        0      0       0
6     2 2021-01-21        0      0       1

CodePudding user response:

Here is a base R solution with ave.

x<-'
ID  date    practice
1   02-27-2020  1
1   04-21-2021  0
1   06-24-2022  0
2   03-21-2019  0
2   09-19-2020  0
2   01-21-2021  0'

df1 <- read.table(textConnection(x), header = TRUE)
df1$date <- as.Date(df1$date, "%m-%d-%Y")

y1 <- with(df1, ave(as.integer(date), ID, practice, FUN = \(x) {
  if(length(x))
    min(x) == x
  else NULL
}))
y2 <- with(df1, ave(as.integer(date), ID, practice, FUN = \(x) {
  if(length(x))
    max(x) == x
  else NULL
}))

df1$lowest <- as.integer(y1 & (df1$practice != 1))
df1$highest <- as.integer(y2 & (df1$practice != 1))
df1
#>   ID       date practice lowest highest
#> 1  1 2020-02-27        1      0       0
#> 2  1 2021-04-21        0      1       0
#> 3  1 2022-06-24        0      0       1
#> 4  2 2019-03-21        0      1       0
#> 5  2 2020-09-19        0      0       0
#> 6  2 2021-01-21        0      0       1

Created on 2022-04-25 by the reprex package (v2.0.1)

CodePudding user response:

A data.table approach:

f <- function(x,p) list(1*(x==min(x[p!=1])), 1*(x==max(x[p!=1])))
setDT(df)[,date:=as.IDate(date, "%m-%d-%Y")][,c("lowest","highest"):=f(date,practice), by=ID][]

Output:

      ID       date practice lowest highest
   <int>     <IDat>    <int>  <num>   <num>
1:     1 2020-02-27        1      0       0
2:     1 2021-04-21        0      1       0
3:     1 2022-06-24        0      0       1
4:     2 2019-03-21        0      1       0
5:     2 2020-09-19        0      0       0
6:     2 2021-01-21        0      0       1

Input:

structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), date = c("02-27-2020", 
"04-21-2021", "06-24-2022", "03-21-2019", "09-19-2020", "01-21-2021"
), practice = c(1L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, -6L
), class = "data.frame")
  • Related