Home > other >  Group data.table by consecutive runs of two id variables, without using split()
Group data.table by consecutive runs of two id variables, without using split()

Time:09-03

My data.table looks like the following (see bottom of post for copy/paste data). Both the id and category variables are grouping variables.

    id category
 1:  1     B100
 2:  1     B100
 3:  1     D300
 4:  1     D300
 5:  1     B100
 6:  2     B100
 7:  2     F500
 8:  2     F500
 9:  2     E600
10:  2     E600
11:  3     T400
12:  3     B100
13:  3     T400
14:  3     T400

Assume that the data is ordered correctly as given. Within each id group, I want to create a variable that indicates the group of each consecutive run of category (eg. see here).

For example, because "B100" has two consecutive runs within id == 1 (rows 1:2 and row 5), the new variable should take the value 1 on rows 1:2, and 2 on row 5, since row 5 was the 'second time' that category == "B100" occured inside id == 1.

For the entire data.table, my desired output is:

    id category group
 1:  1     B100     1 # The first run of 
 2:  1     B100     1 # B100 in id 1, length 2
 3:  1     D300     1
 4:  1     D300     1
 5:  1     B100     2 # second run of "B100" within id 1, length 1
 6:  2     B100     1
 7:  2     F500     1
 8:  2     F500     1
 9:  2     E600     1
10:  2     E600     1 # no repeated category runs in id 2, so all 1
11:  3     T400     1
12:  3     B100     1
13:  3     T400     2 # The second run of 
14:  3     T400     2 # "T400" within id 3, length 2

One way to approach is problem is to use data.table::rleid() twice (let data be DT):

library(data.table)

DT[, group := rleid(category), by = id]

DT <- split(DT, by = "id")

DT <- lapply(DT,
       \(x) x[, group := rleid(group), by = category])

DT <- rbindlist(DT)

Question: Is there a way to do this that avoids splitting up by id in the second step?

Motivation for this question

Data copy/paste

As data.frames for more general use.

DT <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
           category = c("B100","B100","D300","D300","B100",
                        "B100","F500","F500","E600","E600",
                        "T400","B100","T400","T400"))


Output <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
           category = c("B100","B100","D300","D300","B100",
                        "B100","F500","F500","E600","E600",
                        "T400","B100","T400","T400"),
           group = c(1,1,1,1,2,1,1,1,1,1,1,1,2,2))

CodePudding user response:

Here's a way using two grouped operations, no splitting:

Output <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
           category = c("B100","B100","D300","D300","B100",
                        "B100","F500","F500","E600","E600",
                        "T400","B100","T400"),
           group = c(1,1,1,1,2,1,1,1,1,1,1,1,2))

setDT(Output)

Output[, temp := rleid(category), by = .(id)][, result := as.integer(factor(temp)), by = .(id, category)]
Output
#     id category group temp result
#  1:  1     B100     1    1      1
#  2:  1     B100     1    1      1
#  3:  1     D300     1    2      1
#  4:  1     D300     1    2      1
#  5:  1     B100     2    3      2
#  6:  2     B100     1    1      1
#  7:  2     F500     1    2      1
#  8:  2     F500     1    2      1
#  9:  2     E600     1    3      1
# 10:  2     E600     1    3      1
# 11:  3     T400     1    1      1
# 12:  3     B100     1    2      1
# 13:  3     T400     2    3      2
Output[, all(group == result)]
# [1] TRUE
  • Related