For each ID return the earliest date from the start column and the latest date from the end column i-CodePudding

I have a dataset which has multiple start dates and end dates for each Id. I would like to take the earliest date from the "startDate" column and the latest date from the endDate column.


data = data.frame(ID=c(1,1,1,1,2,2,2),
                  startDate= c("2018-01-31", "2018-01-31", "2018-01-31", "2019-06-06",
                          "2002-06-07", "2002-06-07", "2002-09-12"),
                  endDate = c(NA,NA,NA,"2019-07-09",NA,NA, "2002-10-02"))

This is the output I was hoping to get:

data = data.frame(ID=c(1,2),
                  startDate= c("2018-01-31","2002-06-07"),
                  endDate = c("2019-07-09","2002-10-02"))

After trying I have figured out how to do this through the following code, but would prefer something more efficient if at all possible. I am continuously needing to do this and i would rather not have to create two separate dataframes. Thank you guys for your help!

data_start <- data %>%
          group_by(ID) %>%
          arrange(startDate) %>%
             slice(1L)

data_end <- data %>%
  group_by(ID) %>%
  arrange(desc(endDate)) %>%
  slice(1L)

data <- left_join(data_start[,c(1,2)], data_end[,c(1,3)], by="ID")

CodePudding user response：

Or with first and last:

library(dplyr)
data %>% 
  group_by(ID) %>%
  summarise(
    startDate = first(startDate),
    endDate = last(endDate)
  )
# A tibble: 2 x 3
     ID startDate  endDate   
* <dbl> <chr>      <chr>     
1     1 2018-01-31 2019-07-09
2     2 2002-06-07 2002-10-02

CodePudding user response：

You can use min and max, working the variables as dates

 data %>% group_by(ID) %>% 
      summarise(startDate = min(as.Date(startDate),na.rm = T),
                endDate = max(as.Date(endDate),na.rm = T))