Home > Software engineering >  How to group by with keeping other columns?
How to group by with keeping other columns?

Time:07-26

Lets say I have the following dataframe:

df <- data.frame(id = c(1,1,1,2,2,2,3,3,3,3),
             col1 = c("a","a", "b", "c", "d", "e", "f", "g", "h", "g"),
             start_day = c(NA,1,15, NA, 4, 22, 5, 11, 14, 18),
             end_day = c(NA,2, 15, NA, 6, 22, 6, 12, 16, 21))

Output:

   id col1 start_day end_day
1   1    a        NA      NA
2   1    a         1       2
3   1    b        15      15
4   2    c        NA      NA
5   2    d         4       6
6   2    e        22      22
7   3    f         5       6
8   3    g        11      12
9   3    h        14      16
10  3    g        18      21

I want to create a data frame such that for each unique id I get the minimum of start_day column and the maximum of the end_day column. Also I want to keep the other columns. One solution could be using group_by:

df %>% group_by(id) %>% summarise(start_day = min(start_day, na.rm = T),
                              end_day = max(end_day, na.rm = T))

Output:

     id start_day end_day
1     1         1      15
2     2         4      22
3     3         5      21

But I loose other columns (in this example col1). How can I save the other columns. A desired outcome would look like as follow:

     id  start_day   end_day  col1_start  col1_end

1     1         1      15         a           b
2     2         4      22         d           e
3     3         5      21         f           g

Is there anyway that I can get the data frame I need?

CodePudding user response:

Create the index first and then update the 'start_day' as the original column got updated with summarised output

library(dplyr)
df %>% 
  group_by(id) %>% 
  summarise(col1_start = col1[which.min(start_day)], 
    col1_end = col1[which.max(end_day)],
     start_day = min(start_day, na.rm = TRUE),
                     end_day = max(end_day, na.rm = TRUE))

-output

# A tibble: 3 × 5
     id col1_start col1_end start_day end_day
  <dbl> <chr>      <chr>        <dbl>   <dbl>
1     1 a          b                1      15
2     2 d          e                4      22
3     3 f          g                5      21
  • Related