WHAT I WANT: I want to count co-occurrence of two words. But I don't care the order they appear in the string.
MY PROBLEM: I don't know how to deal When two given words appear in different order.
SO FAR: I use unnest_token
function to split the string by words using the "skip_ngrams" option for the token argument. Then I filtered the combination of exactly two words. I use separate
to create word1
and word2
columns. Finally, I count the occurrence.
The output that I get is like this:
# A tibble: 3 × 3
word1 word2 n
<chr> <chr> <dbl>
1 a c 3
2 b a 1
3 c a 5
But words "a" and "c" occur in a different order so they are counted as a different element. What I want is this:
# A tibble: 2 × 3
word1 word2 n
<chr> <chr> <dbl>
1 a c 8
2 b a 1
MY DATA: My data looks like this and this is the whole process with different data but the same problem. In this case "a b" and "c a" should take a value of n = 2.
library(tidyverse)
library(tidytext)
enframe(c("a b c a d e")) %>%
unnest_tokens(skipgram, value, token = "skip_ngrams", n = 5) %>%
mutate(n_words = str_count(skipgram, pattern = "\\S ")) %>%
filter(n_words == 2) %>%
separate(col = skipgram, into = c("word1", "word2"), sep = "\\s ") %>%
count(word1, word2)
#> # A tibble: 9 × 3
#> word1 word2 n
#> <chr> <chr> <int>
#> 1 a b 1
#> 2 a c 1
#> 3 a d 1
#> 4 a e 1
#> 5 b a 1
#> 6 b c 1
#> 7 c a 1
#> 8 c d 1
#> 9 d e 1
Created on 2022-02-09 by the reprex package (v2.0.1)
CodePudding user response:
We may use pmin/pmax
to sort the columns by row before applying the count
library(tidytext)
library(dplyr)
library(stringr)
library(tidyr)
enframe(c("a b c a d e")) %>%
unnest_tokens(skipgram, value, token = "skip_ngrams", n = 5) %>%
mutate(n_words = str_count(skipgram, pattern = "\\S ")) %>%
filter(n_words == 2) %>%
separate(col = skipgram, into = c("word1", "word2"),
sep = "\\s ") %>%
transmute(word11 = pmin(word1, word2), word22 = pmax(word1, word2)) %>%
count(word11, word22)
-output
# A tibble: 7 × 3
word11 word22 n
<chr> <chr> <int>
1 a b 2
2 a c 2
3 a d 1
4 a e 1
5 b c 1
6 c d 1
7 d e 1
CodePudding user response:
The pmin/pmax
approach by @akrun is super efficient. Here is another option with igraph
library(tidytext)
library(dplyr)
library(stringr)
library(tidyr)
library(igraph)
enframe(c("a b c a d e")) %>%
unnest_tokens(skipgram, value, token = "skip_ngrams", n = 5) %>%
mutate(n_words = str_count(skipgram, pattern = "\\S ")) %>%
filter(n_words == 2) %>%
separate(col = skipgram, into = c("word1", "word2"), sep = "\\s ") %>%
count(word1, word2) %>%
graph_from_data_frame(directed = FALSE) %>%
simplify(edge.attr.comb = "sum") %>%
get.data.frame()
which gives
from to n
1 a b 2
2 a c 2
3 a d 1
4 a e 1
5 b c 1
6 c d 1
7 d e 1