I have a table that is shaped like this called df (the actual table is 16,263 rows):
title date brand
big farm house 2022-01-01 A
ranch modern 2022-01-01 A
town house 2022-01-01 C
Then I have a table like this called match_list (the actual list is 94,000 rows):
words_for_match
farm
town
clown
beach
city
pink
And I'm trying to filter the first table to just be rows where the title contains a word in the words_for_match list. So I do this:
match_list <- match_list$words_for_match
match_list <- paste(match_list, collapse = "|")
match_list <- sprintf("\\b(%s)\\b", match_list)
df %>%
filter(grepl(match_list, title))
But then I get the following error:
Problem while computing `..1 = grepl(match_list, subject)`.
Caused by error in `grepl()`:
! invalid regular expression, reason 'Out of memory'
If I filter the table with 94,000 rows to just 1,000 then it runs, so it appears to just be a memory issue. So I'm wondering if there's a less memory-intensive way to do this or if this is an example of needing to look beyond my computer for computation. Advice on either pathway (or other options) is welcome. Thanks!
CodePudding user response:
You could keep titles sequentially, let's say you have 10 titles that match 'farm' you do not need to evaluate those titles with other words. Here a simple implementation :
titles <- c("big farm house", "ranch modern", "town house")
words_for_match <- c("farm", "town", "clown", "beach", "city", "pink")
titles.to.keep <- c()
for(w in words_for_match)
{
w <- sprintf("\\b(%s)\\b", w)
is.match <- grepl(w, titles)
titles.to.keep <- c(titles.to.keep, titles[is.match])
titles <- titles[!is.match]
print(paste(length(titles), "remaining titles"))
}
titles.to.keep
If you have a prior on the frequency of words on match_list, it's better to start with the most frequent ones