I am not sure if I describe my question with the correct title but the idea is:
I would like to longest stretch of rows of data entries of each group after using group_by()
which is also sensitive to the current order of rows. In other words, there are a (or multiple) discontinuities within a group (e.g. after arrange()
by some other columns). I would like to get a new column (e.g. mutate()
) that labels the rows that are within the longest stretch of each group. below is an example:
data.frame(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 3, 1, 2, 2, 2),
order = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17))
In which, I would like to get a data frame like the following:
data.frame(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 3, 1, 2, 2, 2),
order = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17),
longest = c(T, T, T, F, F, T, T, T, T, T, F, F, F, F, T, T, T))
CodePudding user response:
in Base R:
df$longest <- with(rle(df$group),
rep(ave(lengths, values, FUN = max) == lengths,lengths))
df
group order longest
1 1 1 TRUE
2 1 2 TRUE
3 1 3 TRUE
4 2 4 FALSE
5 2 5 FALSE
6 3 6 TRUE
7 3 7 TRUE
8 3 8 TRUE
9 3 9 TRUE
10 3 10 TRUE
11 1 11 FALSE
12 1 12 FALSE
13 3 13 FALSE
14 1 14 FALSE
15 2 15 TRUE
16 2 16 TRUE
17 2 17 TRUE
Another Base R:
a <- rle(df$group)
a$values <- ave(a$lengths, a$values, FUN = max) == a$lengths
df$longest <- inverse.rle(a)
In data.table:
library(data.table)
setDT(df)[, N := .N, by = rleid(group)][, longest := N == max(N), by = group][]
group order N longest
1: 1 1 3 TRUE
2: 1 2 3 TRUE
3: 1 3 3 TRUE
4: 2 4 2 FALSE
5: 2 5 2 FALSE
6: 3 6 5 TRUE
7: 3 7 5 TRUE
8: 3 8 5 TRUE
9: 3 9 5 TRUE
10: 3 10 5 TRUE
11: 1 11 2 FALSE
12: 1 12 2 FALSE
13: 3 13 1 FALSE
14: 1 14 1 FALSE
15: 2 15 3 TRUE
16: 2 16 3 TRUE
17: 2 17 3 TRUE
CodePudding user response:
We could create a group for the consecutive values in the group
column. Then, get the number of rows for those groups, then we can group by group
and return TRUE
for the rows that have the greatest number of consecutive rows for each group.
library(tidyverse)
df %>%
group_by(group_weight = cumsum(c(1, diff(group) != 0))) %>%
mutate(longest = n()) %>%
group_by(group) %>%
mutate(longest = longest == max(longest)) %>%
ungroup %>%
select(-group_weight)
Output
group order longest
<dbl> <dbl> <lgl>
1 1 1 TRUE
2 1 2 TRUE
3 1 3 TRUE
4 2 4 FALSE
5 2 5 FALSE
6 3 6 TRUE
7 3 7 TRUE
8 3 8 TRUE
9 3 9 TRUE
10 3 10 TRUE
11 1 11 FALSE
12 1 12 FALSE
13 3 13 FALSE
14 1 14 FALSE
15 2 15 TRUE
16 2 16 TRUE
17 2 17 TRUE
If you have a tie among consecutive rows and only want to return the first grouping as T
, then you could do something like this:
df2 %>%
group_by(group_weight = cumsum(c(1, diff(group) != 0))) %>%
mutate(longest = n()) %>%
group_by(group) %>%
mutate(longest = longest==max(longest)) %>%
group_by(longest, .add = TRUE) %>%
mutate(x = min(group_weight)) %>%
ungroup(longest) %>%
mutate(longest = longest == TRUE & group_weight == x & !is.na(x)) %>%
ungroup %>%
dplyr::select(-c(group_weight, x))
Output
group order longest
<dbl> <dbl> <lgl>
1 1 1 TRUE
2 1 2 TRUE
3 1 3 TRUE
4 2 4 FALSE
5 2 5 FALSE
6 3 6 TRUE
7 3 7 TRUE
8 3 8 TRUE
9 3 9 TRUE
10 3 10 TRUE
11 1 11 FALSE
12 1 12 FALSE
13 3 13 FALSE
14 1 14 FALSE
15 2 15 TRUE
16 2 16 TRUE
17 2 17 TRUE
18 1 18 FALSE
19 1 19 FALSE
20 1 20 FALSE
Data
df2 <- structure(list(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 1, 1,
3, 1, 2, 2, 2, 1, 1, 1), order = c(1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)), class = "data.frame", row.names = c(NA,
-20L))