How to count occurrences of a word/token in a one-token-per-document-per-row tibble-CodePudding

Hello I have a tibble through a pipe from tidytext::unnest_tokens() and count(category, word, name = "count"). It looks like this example.

owl <- tibble(category = c(0, 1, 2, -1, 0, 1, 2),
              word = c(rep("hello", 3), rep("world", 4)),
              count = sample(1:100, 7))

and I would like to get this tibble with an additional column that gives the number of categories the word appears in, i.e. the same number for each time the word appears.

I tried the following code that works in principal. The result is what I want.

owl %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()}))

However, seeing that my data has 10s of thousands of rows this takes a rather long time. Is there a more efficient way to achieve this?

CodePudding user response：

We could use add_count:

library(dplyr)

 owl %>% 
   add_count(word)

output:

  category word  count     n
     <dbl> <chr> <int> <int>
1        0 hello    98     3
2        1 hello    30     3
3        2 hello    37     3
4       -1 world    22     4
5        0 world    80     4
6        1 world    18     4
7        2 world    19     4

CodePudding user response：

I played around with a few solutions and microbenchmark. I added TarJae's proposition to the benchmark. I also wanted to use the fantastic ave function just to see how it would compare to a dplyr solution.

library(microbenchmark)

n <- 500

owl2 <- tibble(
  category = sample(-10:10, n , replace = TRUE),
  word = sample(stringi::stri_rand_strings(5, 10), n, replace = TRUE),
  count = sample(1:100, n, replace = TRUE))

mb <- microbenchmark(
  op = owl2 %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()})),
  group_by = owl2 %>% group_by(word) %>% mutate(n = n()), 
  add_count = owl2 %>% add_count(word), 
  ave = cbind(owl2, n = ave(owl2$word, owl2$word, FUN = length)), 
  times = 50L)

autoplot(mb)   theme_bw()

The conclusion is that the elegant solution using add_count will save you a lot of time, and, ave speeds up a lot the process.