How can I overcome problems with Dplyr grouping functions?-CodePudding

I'm trying to use the grouping function within dplyr in r. I'm working with the following data

serial_number	entity	start	end	length
1	banana	1	6	6
2	apple	4	8	5
3	pear	7	10	4
3	pear juice	7	16	10
3	juice	12	16	5
3	melon	1	5	5
4	guava	43	47	5
5	tomato	1	6	6
5	juice	8	12	5
5	tomato juice	1	12	12
6	grapes	9	14	6

For each serial number I am trying to extract the longest entity which starts or ends at the same point, I am expecting the following result:

serial_number	entity	start	end	length
1	banana	1	6	6
2	apple	4	8	5
3	pear juice	7	16	10
3	melon	1	5	5
4	guava	43	47	5
5	tomato juice	1	12	12
6	grapes	9	14	6

I am trying to group the data in dplyr and have tried the following to group by serial number and then start (will also need to do similar for end locations).

example_data <- data.frame(serial_number=c(1,2,3,3,3,3,4,5,5,5,6),
                           entity=c("banana","apple","pear","pear juice","juice","melon","guava","tomato","juice","tomato juice","grapes"),
                           start=c(1,4,7,7,12,1,43,1,8,1,9),
                           end=c(6,8,10,16,16,5,47,6,12,12,14),
                           length=c(6,5,4,10,5,5,5,6,5,12,6)
                           )
 

example_data <- dplyr::group_by(example_data, serial_number)
example_data <- dplyr::group_by(example_data, start)
results_test <- dplyr::filter(example_data, length==max(length))

However this yields the following, I have tried other things and I think I have a fundamental misunderstanding of using the group_by function, particularly recursively (i.e. a group and then a further sub-group)

serial_number	entity	start	end	length
2	apple	4	8	5
3	pear juice	7	16	10
3	juice	12	16	5
5	guava	43	47	5
5	juice	8	12	5
6	tomato juice	1	12	12
7	grapes	9	14	6

Any advice gratefully received.

CodePudding user response：

It is just that when we call the group_by multiple times, it overrides the grouping with the last call unless we call .add = TRUE. Here, we don't need to do multiple group_by, as a single group_by can take more than one variable

library(dplyr)
 example_data %>%
  group_by(serial_number, grp = cumsum(start == 1)) %>% 
  slice_max(length) %>% 
  group_by(serial_number) %>% 
  arrange(start, desc(end), .by_group = TRUE) %>%
  filter(!duplicated(start)) %>%
  ungroup %>%
  select(-grp)

-output

 # A tibble: 7 × 5
  serial_number entity       start   end length
          <dbl> <chr>        <dbl> <dbl>  <dbl>
1             1 banana           1     6      6
2             2 apple            4     8      5
3             3 melon            1     5      5
4             3 pear juice       7    16     10
5             4 guava           43    47      5
6             5 tomato juice     1    12     12
7             6 grapes           9    14      6