I'm trying to use the grouping function within dplyr in r. I'm working with the following data
serial_number | entity | start | end | length |
---|---|---|---|---|
1 | banana | 1 | 6 | 6 |
2 | apple | 4 | 8 | 5 |
3 | pear | 7 | 10 | 4 |
3 | pear juice | 7 | 16 | 10 |
3 | juice | 12 | 16 | 5 |
3 | melon | 1 | 5 | 5 |
4 | guava | 43 | 47 | 5 |
5 | tomato | 1 | 6 | 6 |
5 | juice | 8 | 12 | 5 |
5 | tomato juice | 1 | 12 | 12 |
6 | grapes | 9 | 14 | 6 |
For each serial number I am trying to extract the longest entity which starts or ends at the same point, I am expecting the following result:
serial_number | entity | start | end | length |
---|---|---|---|---|
1 | banana | 1 | 6 | 6 |
2 | apple | 4 | 8 | 5 |
3 | pear juice | 7 | 16 | 10 |
3 | melon | 1 | 5 | 5 |
4 | guava | 43 | 47 | 5 |
5 | tomato juice | 1 | 12 | 12 |
6 | grapes | 9 | 14 | 6 |
I am trying to group the data in dplyr and have tried the following to group by serial number and then start (will also need to do similar for end locations).
example_data <- data.frame(serial_number=c(1,2,3,3,3,3,4,5,5,5,6),
entity=c("banana","apple","pear","pear juice","juice","melon","guava","tomato","juice","tomato juice","grapes"),
start=c(1,4,7,7,12,1,43,1,8,1,9),
end=c(6,8,10,16,16,5,47,6,12,12,14),
length=c(6,5,4,10,5,5,5,6,5,12,6)
)
example_data <- dplyr::group_by(example_data, serial_number)
example_data <- dplyr::group_by(example_data, start)
results_test <- dplyr::filter(example_data, length==max(length))
However this yields the following, I have tried other things and I think I have a fundamental misunderstanding of using the group_by function, particularly recursively (i.e. a group and then a further sub-group)
serial_number | entity | start | end | length |
---|---|---|---|---|
2 | apple | 4 | 8 | 5 |
3 | pear juice | 7 | 16 | 10 |
3 | juice | 12 | 16 | 5 |
5 | guava | 43 | 47 | 5 |
5 | juice | 8 | 12 | 5 |
6 | tomato juice | 1 | 12 | 12 |
7 | grapes | 9 | 14 | 6 |
Any advice gratefully received.
CodePudding user response:
It is just that when we call the group_by
multiple times, it overrides the grouping with the last call unless we call .add = TRUE
. Here, we don't need to do multiple group_by, as a single group_by can take more than one variable
library(dplyr)
example_data %>%
group_by(serial_number, grp = cumsum(start == 1)) %>%
slice_max(length) %>%
group_by(serial_number) %>%
arrange(start, desc(end), .by_group = TRUE) %>%
filter(!duplicated(start)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 7 × 5
serial_number entity start end length
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 banana 1 6 6
2 2 apple 4 8 5
3 3 melon 1 5 5
4 3 pear juice 7 16 10
5 4 guava 43 47 5
6 5 tomato juice 1 12 12
7 6 grapes 9 14 6