Home > Back-end >  Count words that co-occurrences of two words but the order is not important in r
Count words that co-occurrences of two words but the order is not important in r

Time:02-10

WHAT I WANT: I want to count co-occurrence of two words. But I don't care the order they appear in the string.

MY PROBLEM: I don't know how to deal When two given words appear in different order.

SO FAR: I use unnest_token function to split the string by words using the "skip_ngrams" option for the token argument. Then I filtered the combination of exactly two words. I use separate to create word1 and word2 columns. Finally, I count the occurrence.

The output that I get is like this:

# A tibble: 3 × 3
  word1 word2      n
  <chr> <chr>  <dbl>
1 a     c          3
2 b     a          1
3 c     a          5

But words "a" and "c" occur in a different order so they are counted as a different element. What I want is this:

# A tibble: 2 × 3
  word1 word2      n
  <chr> <chr>  <dbl>
1 a     c          8
2 b     a          1

MY DATA: My data looks like this and this is the whole process with different data but the same problem. In this case "a b" and "c a" should take a value of n = 2.

library(tidyverse)
library(tidytext)
enframe(c("a b c a d e")) %>% 
  unnest_tokens(skipgram, value, token = "skip_ngrams", n = 5) %>% 
  mutate(n_words = str_count(skipgram, pattern = "\\S ")) %>%
  filter(n_words == 2) %>% 
  separate(col = skipgram, into = c("word1", "word2"), sep = "\\s ") %>% 
  count(word1, word2) 
#> # A tibble: 9 × 3
#>   word1 word2     n
#>   <chr> <chr> <int>
#> 1 a     b         1
#> 2 a     c         1
#> 3 a     d         1
#> 4 a     e         1
#> 5 b     a         1
#> 6 b     c         1
#> 7 c     a         1
#> 8 c     d         1
#> 9 d     e         1

Created on 2022-02-09 by the reprex package (v2.0.1)

CodePudding user response:

We may use pmin/pmax to sort the columns by row before applying the count

library(tidytext)
library(dplyr)
library(stringr)
library(tidyr)
enframe(c("a b c a d e")) %>% 
  unnest_tokens(skipgram, value, token = "skip_ngrams", n = 5) %>% 
  mutate(n_words = str_count(skipgram, pattern = "\\S ")) %>%
  filter(n_words == 2) %>% 
  separate(col = skipgram, into = c("word1", "word2"), 
      sep = "\\s ") %>%
  transmute(word11 = pmin(word1, word2), word22 = pmax(word1, word2)) %>%
  count(word11, word22)

-output

# A tibble: 7 × 3
  word11 word22     n
  <chr>  <chr>  <int>
1 a      b          2
2 a      c          2
3 a      d          1
4 a      e          1
5 b      c          1
6 c      d          1
7 d      e          1

CodePudding user response:

The pmin/pmax approach by @akrun is super efficient. Here is another option with igraph

library(tidytext)
library(dplyr)
library(stringr)
library(tidyr)
library(igraph)

enframe(c("a b c a d e")) %>%
  unnest_tokens(skipgram, value, token = "skip_ngrams", n = 5) %>%
  mutate(n_words = str_count(skipgram, pattern = "\\S ")) %>%
  filter(n_words == 2) %>%
  separate(col = skipgram, into = c("word1", "word2"), sep = "\\s ") %>%
  count(word1, word2) %>%
  graph_from_data_frame(directed = FALSE) %>%
  simplify(edge.attr.comb = "sum") %>%
  get.data.frame()

which gives

  from to n
1    a  b 2
2    a  c 2
3    a  d 1
4    a  e 1
5    b  c 1
6    c  d 1
7    d  e 1
  • Related