The string pattern manipulation is mostly based on within string patterns. Is there a way to detect pattern across strings; in a data frame sense, detecting string patterns across rows. For example, a simple sequence, or pattern, that starts with a
and is followed by one or more b
, like ab
, abb
, abbb
, and so on. Each letter is a row-level observation.
My own way is rather complicated. First paste
all event rows into a single string. Then str_locate_all
the index of patterned string (the start and ending of any ab...b
). Convert the index into matrix, fill in all the index of targeted character using last observation forward method. Once we have the index of all targeted string, I can tag them in the original dataset.
Is there an easier way or package to tag all rows that show such a pattern across rows?
A small example, where two different sequences are detected and numbered:
events | pattern_tag |
---|---|
other | 0 |
b | 0 |
a | 1 # start of sequence, an "a"... |
b | 1 # ...followed by a "b" |
a | 0 |
a | 2 # start of sequence, an "a"... |
b | 2 # ...followed by a "b"... |
b | 2 # ...and a second "b" |
other | 0 |
event <- c("other", "b", "a", "b", "a", "a", "b", "b", "other")
CodePudding user response:
My own way is rather complicated. First
paste
all event rows into a single string. Thenstringr::str_locate_all
the index of patterned string (the start and ending of any "ab...b"). Convert the index into matrix, fill in all the index of targeted character using last observation forward method. Once we have the index of all targeted string, I can tag them in the original dataset.
This is the correct algorithm, although it can be made more efficient. So whether we code this up ourselves, or seek existing function (is there any??), they will basically do the same thing.
Apart from using REGEX, we can also use run length encoding (RLE) to locate all "ab...b" segments. The following function implements both RLE and REGEX methods (using argument use.regex
as a switch).
## `event` is a vector of strings
## `a` is the string marking the start of a segment of interest
## `b` is the string marking the end of that segment
## so, abstractly, we are interested in "ab...b" segments
## where `a` string is followed by several `b` strings
Event2ID <- function (event, a = "a", b = "b", use.regex = FALSE) {
## a simplified character vector with unique values "a", "b" and "c"
x <- letters[match(event, c(a, b), nomatch = 3L)]
## REGEX solution or RLE solution
if (use.regex) {
## collapse everything to a single string
string <- paste0(x, collapse = "")
## in the following, I am using `base::gregexpr`
## you may use `stringr::str_locate_all` for similar results
## the starting location of "ab...b" segments
a.loc <- gregexpr("ab ", string)[[1]]
## length of each "ab...b" segment
len.segments <- attr(a.loc, "match.length")
## number of "ab...b" segments
num.segments <- length(a.loc)
## extra safety:
## in case no "ab...b" segement exists
if (len.segments[1] == -1) {
a.loc <- len.segments <- integer(0)
num.segments <- 0
}
} else {
## run length encoding
RLE <- rle(x)
l <- RLE$lengths
v <- RLE$values
## where do "ab" segments start in (l, v)?
pos <- which(v[-length(v)] == a & v[-1] == b)
## where do the runs of "a" ends in `x`?
## this is also the starting location of "ab...b" segments
a.loc <- cumsum(l)[pos]
## the number of "b" after "a"
num.b <- l[pos 1]
## length of each "ab...b" segment
len.segments <- num.b 1
## number of "ab...b" segments
num.segments <- length(len.segments)
}
## the position index of "ab...b" segments in `x`
## so `x[ind]` extracts all "ab...b" segments
ind <- sequence(nvec = len.segments, from = a.loc)
## create ID sequences for those "ab...b" segments
ID <- rep.int(seq_len(num.segments), times = len.segments)
## create the resulting vector
code <- numeric(length(x))
code[ind] <- ID
code
}
Testing (thanks OP for providing examples of different kinds):
event1 <- c("other", "b", "a", "b", "a", "a", "b", "b", "other")
Event2ID(event1, use.regex = FALSE)
#[1] 0 0 1 1 0 2 2 2 0
Event2ID(event1, use.regex = TRUE)
#[1] 0 0 1 1 0 2 2 2 0
event2 <- c("a","other","b","a","a")
Event2ID(event2, use.regex = FALSE)
#[1] 0 0 0 0 0
Event2ID(event2, use.regex = TRUE)
#[1] 0 0 0 0 0
event3 <- c("a", "other", "other", "other", "a", "b", "b", "b", "b",
"b", "other", "a")
Event2ID(event3, use.regex = FALSE)
#[1] 0 0 0 0 1 1 1 1 1 1 0 0
Event2ID(event3, use.regex = TRUE)
#[1] 0 0 0 0 1 1 1 1 1 1 0 0
Benchmark
I will benchmark my RLE and REGEX solutions, with onyambu's answer.
## put onyambu's code into a function
onyambu <- function (event, a = "a", b = "b") {
tibble(event) %>%
group_by(grp = cumsum(event==a)) %>%
summarise(data.frame(unclass(rle(event))), .groups = 'keep')%>%
mutate(v=values == a & lead(values, default = values[1]) == b,
v = map2(cumsum(v) * (row_number()<=2) * grp, lengths, rep))%>%
ungroup() %>%
unnest(v) %>%
select(values, v) %>%
mutate(v = as.integer(factor(v)) - 1)
}
library(dplyr); library(purrr); library(tidyr) ## for onyambu's code
library(microbenchmark)
## a long vector
event <- rep(event1, 100)
microbenchmark("RLE" = Event2ID(event, use.regex = FALSE),
"REGEX" = Event2ID(event, use.regex = TRUE),
"onyambu" = onyambu(event))
My function is way much faster. In addition, RLE is slightly faster than REGEX.
Unit: microseconds
expr min lq mean median uq max
RLE 214.964 239.0865 264.7935 265.539 281.0670 445.178
REGEX 302.116 315.8950 381.8376 338.728 354.8745 4846.352
onyambu 279255.197 285435.2685 295915.4871 290845.777 306348.5620 471773.368
CodePudding user response:
tibble(event) %>%
group_by(grp = cumsum(event=='a')) %>%
summarise(data.frame(unclass(rle(event))), .groups = 'keep')%>%
mutate(v=values == 'a' & lead(values, default = values[1]) == 'b',
v = map2(cumsum(v) * (row_number()<=2) * grp, lengths, rep))%>%
ungroup() %>%
unnest(v) %>%
select(values, v) %>%
mutate(v = as.integer(factor(v)) - 1)
# A tibble: 9 x 2
values v
<chr> <dbl>
1 other 0
2 b 0
3 a 1
4 b 1
5 a 0
6 a 2
7 b 2
8 b 2
9 other 0