I want to search a word from a Dictionary using R. I used grepl for searching word, x from dictionary dutch. Finally I made this in such way, it will return how many words in a sentence are from this dictionary. My script as follow:
tw_copy$token <- tokenize_tweets(tw_copy$text)
count = function(x){
de = c()
other = c()
for (token in x){
if (grepl(token, dutch)==TRUE){
de <- c(de, token)
}
else{
other <- c(other, token)
}
}
return (c(length(de)/length(x), length(other)/length(x)))
}
result <- lapply(tw_copy$token, FUN = count)
tw_copy$de = lapply(result, "[[", 1)
Now, the output is right. But It is really slow and can not get output for bigger dataset.
Can anyone suggest me to write it other way for faster performance?
CodePudding user response:
I would do something like this:
base R
tw_copy$num_dutch_words = lapply(tw_copy$token, \(x) sum(x %in% dutch))
Input
dutch = c("this", "tweet", "one")
tw_copy = data.frame(
author=c("a","b","c"),
text = c("this is the first tweet",
"this one is the second tweet",
"and this one is the third tweet")
)
tw_copy$token = lapply(tw_copy$text, \(x) strsplit(x, " ")[[1]])
data.table
(this was my original answer, as I assumed you might have some speed up with very large dataset, but my assumption may be wrong)
tw_copy[, num_dutch_words:=sum(token[[1]] %in% dutch), by=1:nrow(tw_copy)]
Input Data:
dutch = c("this", "tweet", "one")
tw_copy = data.table(
author=c("a","b","c"),
text = c("this is the first tweet",
"this one is the second tweet",
"and this one is the third tweet")
)
tw_copy[, token:=list(strsplit(text," "))]
In both cases, output like this:
author text token num_dutch_words
1: a this is the first tweet this,is,the,first,tweet 2
2: b this one is the second tweet this,one,is,the,second,tweet 3
3: c and this one is the third tweet and,this,one,is,the,third,... 3
CodePudding user response:
First, some observations about the approach:
grepl
andsum
are vectorized,- the loop is growing a vector (bad practice),
- Every word is separated by a space, i.e. the delimiter is fixed.
Making a sample dictionary, roughly 11k Dutch words:
library(rvest)
library(stringi)
l <- list(
c("Natuurlijk we kunnen niet anders"),
c("wil jij honderden kinderen de"),
c("van alle geestelijke leiders is")
)
dutch <- read_html("https://cooljugator.com/nl/list/all") %>%
html_elements("a") %>%
html_attr("href") %>%
stri_extract_all_words(simplify = TRUE) %>%
.[,2] %>%
stri_remove_empty() %>%
.[7:length(.)]
lapply speed diminishes if the list grows, use vapply
or a loop that writes to a correctly initialized vector instead. Further, Base R %in%
can be optimized, as is done in the fastmatch
package.
library(fastmatch)
f <- function(data, dictionary) {
vapply(data, \(x){
sum(fmatch(
strsplit(x, " ", fixed = TRUE)[[1]],
dictionary, nomatch = 0L) > 0L
)
}, 1L)
}
f(l, dutch)
#> [1] 1 2 0
After an optimization, a customary benchmark:
library(data.table)
tw_copy = data.table(
author=c("a","b","c"),
text = c("Natuurlijk we kunnen niet anders",
"wil jij honderden kinderen de",
"van alle geestelijke leiders is")
)
bench::mark(
x = f(l, dutch),
y = {
tw_copy[, token:=list(strsplit(text," "))]
tw_copy[, num_dutch_words:=sum(token[[1]] %in% dutch), by=1:nrow(tw_copy)]
}, check = F
)[c(1,3,5,7,9)]
expression median mem_alloc n_itr total_time
<bch:expr> <dbl> <bch:byt> <int> <dbl>
1 x 0.0141 0B 9999 168.
2 y 3.65 706KB 127 459.
Note that the output is not the same, so the benchmark is just an indication, actual result will depend on your input and desired output. Other optimizations could be done to the data structure, for example removing stopwords prior to looping through dictionary that contains verbs.
For optimization it is useful to go through a checklist.
- Is there a vectorized approach?
- Is my data structure suitable for the task?
- Is my loop properly initializing vectors?
If the speed is still slow, other approaches can be considered.
- Splitting up the dictionary
- Joins?
- Use hashed environments?