How to filter a data frame to only min and max values of different columns in R?-CodePudding

Lets say I have the following data frame:

df <- data.frame(id = c(1,1,1,2,2,2,3,3,3,3),
                 col1 = c("a","a", "b", "c", "d", "e", "f", "g", "h", "g"),
                 start_day = c(NA,1,15, NA, 4, 22, 5, 11, 14, 18),
                 end_day = c(NA,2, 15, NA, 6, 22, 6, 12, 16, 21))

I want to create a data frame that has the following columns: id, start_day, end_day such that for each unique id I only need the minimum of start_day column and the maximum of the end_day column. The final data frame should look like as follow:

To get this new data frame I wrote the following code:

df <- df[!(is.na(df$start_day)), ]

dt <- data.frame(matrix(ncol =3 , nrow = length(unique(df$id))))
colnames(dt) <- c("id", "start_day", "end_day")
dt$id <- unique(df$id)


st_day <- vector()
en_day <- vector()
for (elm in dt$id) {
  d <- df[df$id == elm, ]
  minimum <- min(d$start_day)
  maximum <- max(d$end_day)
  st_day <- c(st_day, minimum)
  en_day <- c(en_day, maximum)
}

dt$start_day <- st_day
dt$end_day <- en_day
df <- dt

My code is creating what I am looking for, but I am not happy with it. I would love to learn a better and cleaner way to do the same thing. Any idea is very much appreciated.

CodePudding user response：

You can try data.table like below

> library(data.table)

> na.omit(setDT(df))[, .(start_day = min(start_day), end_day = max(end_day)), id]
   id start_day end_day
1:  1         1      15
2:  2         4      22
3:  3         5      21

CodePudding user response：

This should do:

df %>% group_by(id) %>% summarise(start_day = min(start_day, na.rm = T),
                                  end_day = max(end_day, na.rm = T))

Output:

     id start_day end_day
  <dbl>     <dbl>   <dbl>
1     1         1      15
2     2         4      22
3     3         5      21