Lets say I have the following dataframe:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3,3),
col1 = c("a","a", "b", "c", "d", "e", "f", "g", "h", "g"),
start_day = c(NA,1,15, NA, 4, 22, 5, 11, 14, 18),
end_day = c(NA,2, 15, NA, 6, 22, 6, 12, 16, 21))
Output:
id col1 start_day end_day
1 1 a NA NA
2 1 a 1 2
3 1 b 15 15
4 2 c NA NA
5 2 d 4 6
6 2 e 22 22
7 3 f 5 6
8 3 g 11 12
9 3 h 14 16
10 3 g 18 21
I want to create a data frame such that for each unique id I get the minimum of start_day column and the maximum of the end_day column. Also I want to keep the other columns. One solution could be using group_by:
df %>% group_by(id) %>% summarise(start_day = min(start_day, na.rm = T),
end_day = max(end_day, na.rm = T))
Output:
id start_day end_day
1 1 1 15
2 2 4 22
3 3 5 21
But I loose other columns (in this example col1). How can I save the other columns. A desired outcome would look like as follow:
id start_day end_day col1_start col1_end
1 1 1 15 a b
2 2 4 22 d e
3 3 5 21 f g
Is there anyway that I can get the data frame I need?
CodePudding user response:
Create the index first and then update the 'start_day' as the original column got updated with summarised output
library(dplyr)
df %>%
group_by(id) %>%
summarise(col1_start = col1[which.min(start_day)],
col1_end = col1[which.max(end_day)],
start_day = min(start_day, na.rm = TRUE),
end_day = max(end_day, na.rm = TRUE))
-output
# A tibble: 3 × 5
id col1_start col1_end start_day end_day
<dbl> <chr> <chr> <dbl> <dbl>
1 1 a b 1 15
2 2 d e 4 22
3 3 f g 5 21