Home > OS >  Dealing with long regex patterns in R
Dealing with long regex patterns in R

Time:06-12

I have to apply a long regex pattern to a long string. The regex pattern is something such:

seed(1234)    
myFun <- function(n = 5000) {
      a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
      paste0(a, sprintf("d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
    }
   
long_regex <- paste0(myFun(1000), collapse = "|")
long_regex <- paste0("(", long_regex, ")")

However, gsub can´t deal with such long patterns:

text <- "HPPIZ9166O BHVOF0473O LCVDO3833Z"
gsub(long_regex, "marker \\1;", text)
Error in gsub(long_regex, "marker \\1;", text) : 
  assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', 
  line 634 

How do I overcome this issue? Thank you.

CodePudding user response:

If your regexes are okay as perl regexes, the perl-compatible regex engine seems to cope:

> gsub(long_regex, "marker \\1;", text)
Error in gsub(long_regex, "marker \\1;", text) : 
  assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634

but...

> gsub(long_regex, "marker \\1;", text, perl=TRUE)
[1] "HPPIZ9166O BHVOF0473O LCVDO3833Z"

If I pick out one of the strings from the regex you can see the gsub works in this case:

> substr(long_regex,10000,10100)
[1] "|PZIFO9919X|VBICZ3063E|HZTGZ8881V|PUURO8525W|QLYMN6531U|KTUQZ7171V|GULUD6556Z|UMHSA7400F|DAYHH0017F|Q"
> text = "HZTGZ8881V nope "
> gsub(long_regex, "marker \\1;", text, perl=TRUE)
[1] "marker HZTGZ8881V; nope "
  • Related