My data.table
looks like the following (see bottom of post for copy/paste data). Both the id
and category
variables are grouping variables.
id category
1: 1 B100
2: 1 B100
3: 1 D300
4: 1 D300
5: 1 B100
6: 2 B100
7: 2 F500
8: 2 F500
9: 2 E600
10: 2 E600
11: 3 T400
12: 3 B100
13: 3 T400
14: 3 T400
Assume that the data is ordered correctly as given. Within each id
group, I want to create a variable that indicates the group of each consecutive run of category
(eg. see here).
For example, because "B100"
has two consecutive runs within id == 1
(rows 1:2 and row 5), the new variable should take the value 1
on rows 1:2, and 2
on row 5, since row 5 was the 'second time' that category == "B100"
occured inside id == 1
.
For the entire data.table
, my desired output is:
id category group
1: 1 B100 1 # The first run of
2: 1 B100 1 # B100 in id 1, length 2
3: 1 D300 1
4: 1 D300 1
5: 1 B100 2 # second run of "B100" within id 1, length 1
6: 2 B100 1
7: 2 F500 1
8: 2 F500 1
9: 2 E600 1
10: 2 E600 1 # no repeated category runs in id 2, so all 1
11: 3 T400 1
12: 3 B100 1
13: 3 T400 2 # The second run of
14: 3 T400 2 # "T400" within id 3, length 2
One way to approach is problem is to use data.table::rleid()
twice (let data be DT
):
library(data.table)
DT[, group := rleid(category), by = id]
DT <- split(DT, by = "id")
DT <- lapply(DT,
\(x) x[, group := rleid(group), by = category])
DT <- rbindlist(DT)
Question: Is there a way to do this that avoids splitting up by id
in the second step?
Data copy/paste
As data.frame
s for more general use.
DT <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
category = c("B100","B100","D300","D300","B100",
"B100","F500","F500","E600","E600",
"T400","B100","T400","T400"))
Output <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
category = c("B100","B100","D300","D300","B100",
"B100","F500","F500","E600","E600",
"T400","B100","T400","T400"),
group = c(1,1,1,1,2,1,1,1,1,1,1,1,2,2))
CodePudding user response:
Here's a way using two grouped operations, no splitting:
Output <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
category = c("B100","B100","D300","D300","B100",
"B100","F500","F500","E600","E600",
"T400","B100","T400"),
group = c(1,1,1,1,2,1,1,1,1,1,1,1,2))
setDT(Output)
Output[, temp := rleid(category), by = .(id)][, result := as.integer(factor(temp)), by = .(id, category)]
Output
# id category group temp result
# 1: 1 B100 1 1 1
# 2: 1 B100 1 1 1
# 3: 1 D300 1 2 1
# 4: 1 D300 1 2 1
# 5: 1 B100 2 3 2
# 6: 2 B100 1 1 1
# 7: 2 F500 1 2 1
# 8: 2 F500 1 2 1
# 9: 2 E600 1 3 1
# 10: 2 E600 1 3 1
# 11: 3 T400 1 1 1
# 12: 3 B100 1 2 1
# 13: 3 T400 2 3 2
Output[, all(group == result)]
# [1] TRUE