Home > OS >  Create grouping variable from string sequences
Create grouping variable from string sequences

Time:07-13

The string pattern manipulation is mostly based on within string patterns. Is there a way to detect pattern across strings; in a data frame sense, detecting string patterns across rows. For example, a simple sequence, or pattern, that starts with a and is followed by one or more b, like ab, abb, abbb, and so on. Each letter is a row-level observation.

My own way is rather complicated. First paste all event rows into a single string. Then str_locate_all the index of patterned string (the start and ending of any ab...b). Convert the index into matrix, fill in all the index of targeted character using last observation forward method. Once we have the index of all targeted string, I can tag them in the original dataset.

Is there an easier way or package to tag all rows that show such a pattern across rows?

A small example, where two different sequences are detected and numbered:

events pattern_tag
other 0
b 0
a 1 # start of sequence, an "a"...
b 1 # ...followed by a "b"
a 0
a 2 # start of sequence, an "a"...
b 2 # ...followed by a "b"...
b 2 # ...and a second "b"
other 0
event <- c("other", "b", "a", "b", "a", "a", "b", "b", "other")

CodePudding user response:

My own way is rather complicated. First paste all event rows into a single string. Then stringr::str_locate_all the index of patterned string (the start and ending of any "ab...b"). Convert the index into matrix, fill in all the index of targeted character using last observation forward method. Once we have the index of all targeted string, I can tag them in the original dataset.

This is the correct algorithm, although it can be made more efficient. So whether we code this up ourselves, or seek existing function (is there any??), they will basically do the same thing.

Apart from using REGEX, we can also use run length encoding (RLE) to locate all "ab...b" segments. The following function implements both RLE and REGEX methods (using argument use.regex as a switch).

## `event` is a vector of strings
## `a` is the string marking the start of a segment of interest
## `b` is the string marking the end of that segment
## so, abstractly, we are interested in "ab...b" segments
## where `a` string is followed by several `b` strings
Event2ID <- function (event, a = "a", b = "b", use.regex = FALSE) {
  ## a simplified character vector with unique values "a", "b" and "c"
  x <- letters[match(event, c(a, b), nomatch = 3L)]
  ## REGEX solution or RLE solution
  if (use.regex) {
    ## collapse everything to a single string
    string <- paste0(x, collapse = "")
    ## in the following, I am using `base::gregexpr`
    ## you may use `stringr::str_locate_all` for similar results
    ## the starting location of "ab...b" segments
    a.loc <- gregexpr("ab ", string)[[1]]
    ## length of each "ab...b" segment
    len.segments <- attr(a.loc, "match.length")
    ## number of "ab...b" segments
    num.segments <- length(a.loc)
    ## extra safety:
    ## in case no "ab...b" segement exists
    if (len.segments[1] == -1) {
      a.loc <- len.segments <- integer(0)
      num.segments <- 0
    }
  } else {
    ## run length encoding
    RLE <- rle(x)
    l <- RLE$lengths
    v <- RLE$values
    ## where do "ab" segments start in (l, v)?
    pos <- which(v[-length(v)] == a & v[-1] == b)
    ## where do the runs of "a" ends in `x`?
    ## this is also the starting location of "ab...b" segments
    a.loc <- cumsum(l)[pos]
    ## the number of "b" after "a"
    num.b <- l[pos   1]
    ## length of each "ab...b" segment
    len.segments <- num.b   1
    ## number of "ab...b" segments
    num.segments <- length(len.segments)
  }
  ## the position index of "ab...b" segments in `x`
  ## so `x[ind]` extracts all "ab...b" segments
  ind <- sequence(nvec = len.segments, from = a.loc)
  ## create ID sequences for those "ab...b" segments
  ID <- rep.int(seq_len(num.segments), times = len.segments)
  ## create the resulting vector
  code <- numeric(length(x))
  code[ind] <- ID
  code
}

Testing (thanks OP for providing examples of different kinds):

event1 <- c("other", "b", "a", "b", "a", "a", "b", "b", "other")
Event2ID(event1, use.regex = FALSE)
#[1] 0 0 1 1 0 2 2 2 0
Event2ID(event1, use.regex = TRUE)
#[1] 0 0 1 1 0 2 2 2 0

event2 <- c("a","other","b","a","a")
Event2ID(event2, use.regex = FALSE)
#[1] 0 0 0 0 0
Event2ID(event2, use.regex = TRUE)
#[1] 0 0 0 0 0

event3 <- c("a", "other", "other", "other", "a", "b", "b", "b", "b",
            "b", "other", "a")
Event2ID(event3, use.regex = FALSE)
#[1] 0 0 0 0 1 1 1 1 1 1 0 0
Event2ID(event3, use.regex = TRUE)
#[1] 0 0 0 0 1 1 1 1 1 1 0 0

Benchmark

I will benchmark my RLE and REGEX solutions, with onyambu's answer.

## put onyambu's code into a function
onyambu <- function (event, a = "a", b = "b") {
  tibble(event) %>%
  group_by(grp = cumsum(event==a)) %>%
  summarise(data.frame(unclass(rle(event))), .groups = 'keep')%>%
  mutate(v=values == a & lead(values, default = values[1]) == b,
         v = map2(cumsum(v) * (row_number()<=2) * grp, lengths, rep))%>%
  ungroup()  %>%
  unnest(v) %>%
  select(values, v) %>%
  mutate(v = as.integer(factor(v)) - 1)
}
library(dplyr); library(purrr); library(tidyr)  ## for onyambu's code
library(microbenchmark)

## a long vector
event <- rep(event1, 100)

microbenchmark("RLE" = Event2ID(event, use.regex = FALSE),
               "REGEX" = Event2ID(event, use.regex = TRUE),
               "onyambu" = onyambu(event))

My function is way much faster. In addition, RLE is slightly faster than REGEX.

Unit: microseconds
    expr        min          lq        mean     median          uq        max
     RLE    214.964    239.0865    264.7935    265.539    281.0670    445.178
   REGEX    302.116    315.8950    381.8376    338.728    354.8745   4846.352
 onyambu 279255.197 285435.2685 295915.4871 290845.777 306348.5620 471773.368

CodePudding user response:

tibble(event) %>%
  group_by(grp = cumsum(event=='a')) %>%
  summarise(data.frame(unclass(rle(event))), .groups = 'keep')%>%
  mutate(v=values == 'a' & lead(values, default = values[1]) == 'b',
         v = map2(cumsum(v) * (row_number()<=2) * grp, lengths, rep))%>%
  ungroup()  %>%
  unnest(v) %>%
  select(values, v) %>%
  mutate(v = as.integer(factor(v)) - 1)

# A tibble: 9 x 2
  values     v
  <chr>  <dbl>
1 other      0
2 b          0
3 a          1
4 b          1
5 a          0
6 a          2
7 b          2
8 b          2
9 other      0
  • Related