Fastest way to search a word into Dictionary in R-CodePudding

I want to search a word from a Dictionary using R. I used grepl for searching word, x from dictionary dutch. Finally I made this in such way, it will return how many words in a sentence are from this dictionary. My script as follow:

tw_copy$token <- tokenize_tweets(tw_copy$text)

count = function(x){
  de = c()
  other = c()
    for (token in x){
      if (grepl(token, dutch)==TRUE){
        de <- c(de, token)
      }
      else{
        other <- c(other, token)
      }
    }
  return (c(length(de)/length(x), length(other)/length(x)))
}

result <- lapply(tw_copy$token, FUN = count)

tw_copy$de =  lapply(result, "[[", 1)

Now, the output is right. But It is really slow and can not get output for bigger dataset.

Can anyone suggest me to write it other way for faster performance?

CodePudding user response：

I would do something like this:

base R

tw_copy$num_dutch_words = lapply(tw_copy$token, \(x) sum(x %in% dutch))

Input

dutch = c("this", "tweet", "one")
tw_copy = data.frame(
  author=c("a","b","c"),
  text = c("this is the first tweet",
           "this one is the second tweet",
           "and this one is the third tweet")
)
tw_copy$token = lapply(tw_copy$text, \(x) strsplit(x, " ")[[1]])

data.table

(this was my original answer, as I assumed you might have some speed up with very large dataset, but my assumption may be wrong)

tw_copy[, num_dutch_words:=sum(token[[1]] %in% dutch), by=1:nrow(tw_copy)]

Input Data:

dutch = c("this", "tweet", "one")

tw_copy = data.table(
  author=c("a","b","c"),
  text = c("this is the first tweet",
           "this one is the second tweet",
           "and this one is the third tweet")
)
tw_copy[, token:=list(strsplit(text," "))]

In both cases, output like this:

   author                            text                         token num_dutch_words
1:      a         this is the first tweet       this,is,the,first,tweet               2
2:      b    this one is the second tweet  this,one,is,the,second,tweet               3
3:      c and this one is the third tweet and,this,one,is,the,third,...               3

CodePudding user response：

First, some observations about the approach:

grepl and sum are vectorized,
the loop is growing a vector (bad practice),
Every word is separated by a space, i.e. the delimiter is fixed.

Making a sample dictionary, roughly 11k Dutch words:

library(rvest)
library(stringi)
l <- list(
  c("Natuurlijk we kunnen niet anders"),
  c("wil jij honderden kinderen de"),
  c("van alle geestelijke leiders is")
)

dutch <- read_html("https://cooljugator.com/nl/list/all") %>%
  html_elements("a") %>%
  html_attr("href") %>%
  stri_extract_all_words(simplify = TRUE) %>%
  .[,2] %>%
  stri_remove_empty() %>%
  .[7:length(.)]

lapply speed diminishes if the list grows, use vapply or a loop that writes to a correctly initialized vector instead. Further, Base R %in% can be optimized, as is done in the fastmatch package.

library(fastmatch)
f <- function(data, dictionary) {
  vapply(data, \(x){
    sum(fmatch(
        strsplit(x, " ", fixed = TRUE)[[1]],
        dictionary, nomatch = 0L) > 0L
    )
  }, 1L)
}
f(l, dutch)
#> [1] 1 2 0

After an optimization, a customary benchmark:

library(data.table)
tw_copy = data.table(
  author=c("a","b","c"),
  text = c("Natuurlijk we kunnen niet anders",
           "wil jij honderden kinderen de",
           "van alle geestelijke leiders is")
)

bench::mark(
  x = f(l, dutch),
  y = {
    tw_copy[, token:=list(strsplit(text," "))]
    tw_copy[, num_dutch_words:=sum(token[[1]] %in% dutch), by=1:nrow(tw_copy)]
  }, check = F
)[c(1,3,5,7,9)]

  expression median mem_alloc n_itr total_time
  <bch:expr>  <dbl> <bch:byt> <int>      <dbl>
1 x          0.0141        0B  9999       168.
2 y          3.65       706KB   127       459.

Note that the output is not the same, so the benchmark is just an indication, actual result will depend on your input and desired output. Other optimizations could be done to the data structure, for example removing stopwords prior to looping through dictionary that contains verbs.

For optimization it is useful to go through a checklist.

Is there a vectorized approach?
Is my data structure suitable for the task?
Is my loop properly initializing vectors?

If the speed is still slow, other approaches can be considered.

Splitting up the dictionary
Joins?
Use hashed environments?