Home > Mobile >  Within a column, I'd like to gsub each row of string values and remove any value that matches a
Within a column, I'd like to gsub each row of string values and remove any value that matches a



I am working with a messy datafile right now. I have a list of comments that I'd like to sort out and grab the most common combination of phrases. An example phrase would be "Did not qualify because of X and Y" and "Did not qualify because of Y and X". I am trying to go through and remove Stop Words so I can match X and Y as a common phrase. I was able to easily do this for common single words, but phrases are a little difficult. Below is my code for context

Create Datafile

dat1 <- dat %>% filter(Action != Exclude)

Remove problem characters

dat1$Comments <- stri_trans_general(dat1$Comments, "latin-ascii")
dat1$Comments <- gsub(pattern='<[^<>]*>', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern='\n', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern="[[:punct:]]", replacement=" ", x=dat1$Comments)

Remove stop words (Where my problem is)

sw <- paste0("\\b(", paste0(stop_words$word, collapse="|"), ")\\b")
dat1$Comments <- lapply(dat1$Comments, function(x) (gsub(pattern=sw, replacement=" ", x)))

Remove extra spaces between words

dat1$Comments <- trimws(gsub("\\s ", " ", dat1$Comments))
dat1$Comments <- gsub("(^[[:space:]]*)|([[:space:]]*$)", "", dat1$Comments)

Sweet Data

top_phrases <- data.frame(text = dat1$Comments) %>%
unnest_tokens(bigram, text, 'ngrams', n = Length, to_lower = TRUE) %>% 
count(bigram, sort = TRUE)


This is what pops up and is traced back to the gsub code

 Error in gsub(pattern = sw, replacement = " ", x) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634 

If anyone is curious, here is what is stored in "sw"


CodePudding user response:

Both TRE (the default regex engine used in base R regex functions) and PCRE (the regex engine used in base R regex functions with perl=TRUE) have quite hard limits for the pattern length.

In your case, stringr regex functions will work better as they are using ICU regex engine that supports much longer regex patterns.

So, you may replace

gsub(pattern=sw, replacement=" ", x)


stringr::str_replace_all(x, sw, " ")
  • Related