R Quanteda Creating and computing percentage of co-occurrence based on keywords-CodePudding

Hello I have the following data set:

df <- data.frame (text  = c("House Sky Blue",
                            "House Sky Green",
                            "House Sky Red",
                            "House Sky Yellow",
                            "House Sky Green",
                            "House Sky Glue",
                            "House Sky Green"))

I'd like to find the percentage of co-occurren of some terms of tokens, for example I wanna how out of all documents where I can find the token "House" and at the same time how many of them also include the term "Green"

In out data we have 7 documents that have the term House and 3 out of those 7 p=(100*3/7) also include the term Green, It would be so nice to see also what terms or tokens appear within some p thershold along side the token "House"

I have used thes two funs

textstat_collocations(tokens)

> textstat_collocations(tokens)
  collocation count count_nested length   lambda        z
1   house sky     7            0      2 5.416100 2.622058
2   sky green     3            0      2 2.456736 1.511653

Fun textstat_simil

textstat_simil(dfm(tokens),margin="features")

textstat_simil object; method = "correlation"
       house sky   blue  green    red yellow   glue
house    NaN NaN    NaN    NaN    NaN    NaN    NaN
sky      NaN NaN    NaN    NaN    NaN    NaN    NaN
blue     NaN NaN  1.000 -0.354 -0.167 -0.167 -0.167
green    NaN NaN -0.354  1.000 -0.354 -0.354 -0.354
red      NaN NaN -0.167 -0.354  1.000 -0.167 -0.167
yellow   NaN NaN -0.167 -0.354 -0.167  1.000 -0.167
glue     NaN NaN -0.167 -0.354 -0.167 -0.167  1.000

but they do not seem to give my desired output also I wonder why the correlation btw green and house is NaN for the textsats_simil fun

My desired output would show the following info:

feature="House"
 percentage of co-occurrence 

Green = 3/7
Blue= 1/7
Red = 1/7
Yellow = 1/7
Glue = 1/7

I am a super attentive user and I will be upvoting and selecting the best answers thank you SO MUCH guys for your help since in the quetda docs I cant seme to find a fun that can give me my desired output although I know there must be a way arround since I find this libary to be so fast and complete!I will be looking forward for a solution only suing the quanteda library thank you so much guys again

CodePudding user response：

I couldn't find anything inside quanteda, so I cobbled something together. One function to create a list object with the chosen word and frequency table and one print function to print the output like you want. You can adjust the functions to just return what you want and add more test to check on the inputs.

Code part:

dat <- data.frame (text  = c("House Sky Blue",
                            "House Sky Green",
                            "House Sky Red",
                            "House Sky Yellow",
                            "House Sky Green",
                            "House Sky Glue",
                            "House Sky Green"))


library(quanteda)
library(quanteda.textstats)

my_dfm <- dfm(tokens(corpus(dat)))
freqs <- textstat_frequency(my_dfm)

# create function to return a list with the chosen word and a frequency table    
create_co_occurrence <- function(x, word) {
  
  if(!inherits(x, "frequency")) {
    stop("x must be a frequency table generated by textstat_frequency." 
         ,call. = FALSE)
  }
  
  # add check to see if word is a character
  
  input <- x
  
  word_frequency <- input$frequency[input$feature == word]
  
  out <- input[input$feature != word, ]
  out$percentage <- out$frequency / word_frequency
  out <- out[, c("feature", "percentage")]
  # reset row.names
  row.names(out) <- NULL

  out_list <- list(word = word,
                   co_occurrence = out)
    
  class(out_list) <- c("co_occurrence", "list")
  out_list
}

# create print function.
print.co_occurrence <- function(x, ...) {
  
  writeLines(sprintf("feature = %s"  , x$word))
  writeLines("percentage of co-occurrence
             ")
  print.data.frame(x$co_occurrence)
}

output:

test <- create_co_occurrence(freqs, "house")

# calling test will activate the print.co_occurrence function and format the results
test

feature = house
percentage of co-occurrence
             
  feature percentage
2     sky  1.0000000
3   green  0.4285714
4    blue  0.1428571
5     red  0.1428571
6  yellow  0.1428571
7    glue  0.1428571