Parse a large vector of characters quickly with regex-CodePudding

I have a problem where calling my current regular expression on very large strings seems to nuke a container I have the code running in. I know of one straightforward replacement for my code that is slightly slower, but I feel like there should be faster implementations. I would prefer to stay in base R.

An example of my code looks like:

# create a small example
library(microbenchmark)
set.seed(1986)

SET1 <- c("A", "B", "C", "D", "=", "E")
SET2 <- seq(1:100)
SET3 <- sample(x = 1:6,
               size = 1000,
               replace = TRUE)

CIGAR <- sapply(X = SET3,
                FUN = function(x) {
                  z1 <- sample(x = SET1,
                               size = x,
                               replace = TRUE)
                  z2 <- sample(x = SET2,
                               size = x,
                               replace = TRUE)
                  z3 <- paste0(z1, z2, collapse = "")
                },
                simplify = TRUE)

# the original implementation:
# this appears to ask for exorbitant amounts of memory when the original string is very large
# or at least the docker container dies and that appears to be the cause
f1 <- function() {
  CIGARType <- regmatches(x = CIGAR,
                          m = gregexpr(pattern = "[^0-9]",
                                       text = CIGAR))
  CIGARCount <- regmatches(x = CIGAR,
                           m = gregexpr(pattern = "[0-9] ",
                                        text = CIGAR))
  CIGARCount <- lapply(X = CIGARCount,
                       FUN = function(x) {
                         as.integer(x)
                       })
  
  return(list(CIGARCount,
              CIGARType))
}

# the current alternative:
f2 <- function() {
  CIGARType <- CIGARCount <- vector(mode = "list",
                                    length = length(CIGAR))
  for (m1 in seq_along(CIGAR)) {
    CIGARType[m1] <- regmatches(x = CIGAR[m1],
                                m = gregexpr(pattern = "[^0-9]",
                                             text = CIGAR[m1]))
    CIGARCount[m1] <- regmatches(x = CIGAR[m1],
                                 m = gregexpr(pattern = "[0-9] ",
                                              text = CIGAR[m1]))
    CIGARCount[[m1]] <- as.integer(CIGARCount[[m1]])
  }
  return(list(CIGARCount,
              CIGARType))
}

My alternative works and is slower, but the container doesn't explode:

> microbenchmark(f1() -> a, f2() -> b)
Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval
 a <- f1() 13.84300 15.21332 20.20298 16.92589 19.15321 101.0486   100
 b <- f2() 53.29438 58.50947 67.20698 62.89677 70.24507 158.4985   100

> all(mapply(function(x, y) all(x == y), x = a[[1]], y = b[[1]]))
[1] TRUE
> all(mapply(function(x, y) all(x == y), x = a[[2]], y = b[[2]]))
[1] TRUE

Are there alternative quick ways to parse these strings?

CodePudding user response：

Reduce the number of times you perform a regex, they are the most expensive component here. It is generally much faster to extract the non-number and number(s) in one step, then substring them. Try this:

f3 <- function() {
  tmp <- regmatches(CIGAR, gregexpr("[^0-9][0-9] ", CIGAR))
  types <- lapply(tmp, substr, 1, 1)
  counts <- lapply(tmp, function(z) as.integer(substring(z, 2)))
  list(counts, types)
}
out1 <- f1()
out3 <- f3()
identical(out1, out3)
# [1] TRUE

res <- bench::mark(f1(), f2(), f3(), min_iterations = 500)
res
# # A tibble: 3 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result     memory                 time             gc                
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>     <list>                 <list>           <list>            
# 1 f1()        11.13ms     15ms      63.6    15.9MB     52.4   274   226      4.31s <list [2]> <Rprofmem [4,587 x 3]> <bench_tm [500]> <tibble [500 x 3]>
# 2 f2()        43.03ms   50.4ms      18.8   15.82MB     39.3   162   338       8.6s <list [2]> <Rprofmem [4,002 x 3]> <bench_tm [500]> <tibble [500 x 3]>
# 3 f3()         8.15ms   10.1ms      99.0    7.96MB     29.2   386   114       3.9s <list [2]> <Rprofmem [2,008 x 3]> <bench_tm [500]> <tibble [500 x 3]>

plot(res)

(Note the x-axis is log.)

The key column I look at is `itr/src`, which is saying that this operates 55% faster (in this run) than f1() and over 5x faster than f2(). Also note from mem_alloc that f3 is using just over half the memory allocation. (The actual performance ratios waffle between multiple runs, from 25-66% faster than f1, and 4.5-5.2x faster than f2. I'm showing just one. Every time I've done this benchmark, though, the relative performance is about the same, never does f3 come out anywhere worse than 26% faster than f1.)

(FYI, I hope your use of f1 with no arguments and scope-breaching for CIGAR is just contrived for the benchmark and/or this question. If you're writing functions that work on data, then pass the data as an argument to the function; never scope-breach like this if at all avoidable, it breaks reproducibility and can make troubleshooting frustratingly difficult.)