Home > database >  String match error "invalid regular expression, reason 'Out of memory'"
String match error "invalid regular expression, reason 'Out of memory'"

Time:11-01

I have a table that is shaped like this called df (the actual table is 16,263 rows):

title             date            brand
big farm house    2022-01-01      A
ranch modern      2022-01-01      A
town house        2022-01-01      C

Then I have a table like this called match_list (the actual list is 94,000 rows):

words_for_match
farm
town
clown
beach
city
pink

And I'm trying to filter the first table to just be rows where the title contains a word in the words_for_match list. So I do this:

match_list <- match_list$words_for_match

match_list <- paste(match_list, collapse = "|")

match_list <- sprintf("\\b(%s)\\b", match_list)

df %>% 
  filter(grepl(match_list, title))

But then I get the following error:

Problem while computing `..1 = grepl(match_list, subject)`.
Caused by error in `grepl()`:
! invalid regular expression, reason 'Out of memory'

If I filter the table with 94,000 rows to just 1,000 then it runs, so it appears to just be a memory issue. So I'm wondering if there's a less memory-intensive way to do this or if this is an example of needing to look beyond my computer for computation. Advice on either pathway (or other options) is welcome. Thanks!

CodePudding user response:

You could keep titles sequentially, let's say you have 10 titles that match 'farm' you do not need to evaluate those titles with other words. Here a simple implementation :

titles <- c("big farm house", "ranch modern", "town house")
words_for_match <- c("farm", "town", "clown", "beach", "city", "pink")
titles.to.keep <- c()
for(w in words_for_match)
{
    w <- sprintf("\\b(%s)\\b", w)
    is.match <- grepl(w, titles)
    titles.to.keep <- c(titles.to.keep, titles[is.match])
    titles <- titles[!is.match]
    print(paste(length(titles), "remaining titles"))
}
titles.to.keep

If you have a prior on the frequency of words on match_list, it's better to start with the most frequent ones

  • Related