Hello I have the following data set:
df <- data.frame (text = c("House Sky Blue",
"House Sky Green",
"House Sky Red",
"House Sky Yellow",
"House Sky Green",
"House Sky Glue",
"House Sky Green"))
I'd like to find the percentage of co-occurren of some terms of tokens, for example I wanna how out of all documents where I can find the token "House" and at the same time how many of them also include the term "Green"
In out data we have 7 documents that have the term House and 3 out of those 7 p=(100*3/7) also include the term Green, It would be so nice to see also what terms or tokens appear within some p thershold along side the token "House"
I have used thes two funs
textstat_collocations(tokens)
> textstat_collocations(tokens)
collocation count count_nested length lambda z
1 house sky 7 0 2 5.416100 2.622058
2 sky green 3 0 2 2.456736 1.511653
Fun textstat_simil
textstat_simil(dfm(tokens),margin="features")
textstat_simil object; method = "correlation"
house sky blue green red yellow glue
house NaN NaN NaN NaN NaN NaN NaN
sky NaN NaN NaN NaN NaN NaN NaN
blue NaN NaN 1.000 -0.354 -0.167 -0.167 -0.167
green NaN NaN -0.354 1.000 -0.354 -0.354 -0.354
red NaN NaN -0.167 -0.354 1.000 -0.167 -0.167
yellow NaN NaN -0.167 -0.354 -0.167 1.000 -0.167
glue NaN NaN -0.167 -0.354 -0.167 -0.167 1.000
but they do not seem to give my desired output also I wonder why the correlation btw green and house is NaN for the textsats_simil
fun
My desired output would show the following info:
feature="House"
percentage of co-occurrence
Green = 3/7
Blue= 1/7
Red = 1/7
Yellow = 1/7
Glue = 1/7
I am a super attentive user and I will be upvoting and selecting the best answers thank you SO MUCH guys for your help since in the quetda docs I cant seme to find a fun that can give me my desired output although I know there must be a way arround since I find this libary to be so fast and complete!I will be looking forward for a solution only suing the quanteda library thank you so much guys again
CodePudding user response:
I couldn't find anything inside quanteda, so I cobbled something together. One function to create a list object with the chosen word and frequency table and one print function to print the output like you want. You can adjust the functions to just return what you want and add more test to check on the inputs.
Code part:
dat <- data.frame (text = c("House Sky Blue",
"House Sky Green",
"House Sky Red",
"House Sky Yellow",
"House Sky Green",
"House Sky Glue",
"House Sky Green"))
library(quanteda)
library(quanteda.textstats)
my_dfm <- dfm(tokens(corpus(dat)))
freqs <- textstat_frequency(my_dfm)
# create function to return a list with the chosen word and a frequency table
create_co_occurrence <- function(x, word) {
if(!inherits(x, "frequency")) {
stop("x must be a frequency table generated by textstat_frequency."
,call. = FALSE)
}
# add check to see if word is a character
input <- x
word_frequency <- input$frequency[input$feature == word]
out <- input[input$feature != word, ]
out$percentage <- out$frequency / word_frequency
out <- out[, c("feature", "percentage")]
# reset row.names
row.names(out) <- NULL
out_list <- list(word = word,
co_occurrence = out)
class(out_list) <- c("co_occurrence", "list")
out_list
}
# create print function.
print.co_occurrence <- function(x, ...) {
writeLines(sprintf("feature = %s" , x$word))
writeLines("percentage of co-occurrence
")
print.data.frame(x$co_occurrence)
}
output:
test <- create_co_occurrence(freqs, "house")
# calling test will activate the print.co_occurrence function and format the results
test
feature = house
percentage of co-occurrence
feature percentage
2 sky 1.0000000
3 green 0.4285714
4 blue 0.1428571
5 red 0.1428571
6 yellow 0.1428571
7 glue 0.1428571