Hello Guys I have a quite massive data sets that cointais reviews of utilities services from customers all over the UK, this is a small sample of what the data looks like
df <- data.frame (text = c("The investors and their supporters shall invest and do something mostly invest",
" Shall we tell the investors to invest ?", "Investors shall invest.",
"Investors may sometimes invest","spend what Investor Do"),
date = c("10/12/2022", "10/12/2022", "10/12/2022","11/12/2022","12/12/2022"))
- What I want is to be able to count the frequency of terms/words/tokens by
date
for instance the word invest
appears 6 times in total, so for the date 10/12/2022 its word count is 4 I wanna be able to use the quanteda library (since it is so powerfull) to achieve the count and plot the viz over date
and
- I also wanna plot the association or co-occurence of the word investor & invest over date
for instance we have in this example 5 reviews in those reviews 4/5 times the word invest and investor were present and I'd like to plot that percentage over date
as well is that is possible or what amazing options does the quantada lib has that can perfomre this task? will it be possible to also find lets say a min porcentage of the 0.25 most frequente qords that appear when "invest" appears
to achive the first point I started witht he following code:
df %>%
corpus(text_field="text") %>%
dfm() %>%
textstat_frequency(10)
which gives:
feature frequency rank docfreq group
1 invest 6 1 5 all
2 investors 4 2 4 all
3 shall 3 3 3 all
4 the 2 4 2 all
5 and 2 4 1 all
6 do 2 4 2 all
7 their 1 7 1 all
8 supporters 1 7 1 all
9 something 1 7 1 all
10 we 1 7 1 all
Warning message:
'dfm.corpus()' is deprecated. Use 'tokens()' first.
by how would I go about plotting the frequency of this words over the date column, I read in the documentation that one can group but I had have no luck in doing so
and for the second question I dont know for sure if what funtion of the quenteda lib to use but I am trying to mirror the tm::findAssocs() fun from the tm library
I am super attentive to your answers guys I will be upvoting and picking the answer as soons as they come THANKS A TRILLION for your help it really mena sthe world to me
CodePudding user response:
Answer to your first question:
The dates are put into the docvars
part of your corpus. This can be used within the textstat_frequency
with the group
option.
dat <- data.frame (text = c("The investors and their supporters shall invest and do something mostly invest",
" Shall we tell the investors to invest ?", "Investors shall invest.",
"Investors may sometimes invest","spend what Investor Do"),
date = c("10/12/2022", "10/12/2022", "10/12/2022","11/12/2022","12/12/2022"))
library(dplyr)
library(quanteda)
library(quanteda.textstats)
dat %>%
corpus(text_field="text") %>%
tokens() %>%
dfm() %>%
textstat_frequency(groups = date)
feature frequency rank docfreq group
1 invest 4 1 3 10/12/2022
2 investors 3 2 3 10/12/2022
3 shall 3 2 3 10/12/2022
4 the 2 4 2 10/12/2022
5 and 2 4 1 10/12/2022
6 their 1 6 1 10/12/2022
7 supporters 1 6 1 10/12/2022
8 do 1 6 1 10/12/2022
9 something 1 6 1 10/12/2022
10 mostly 1 6 1 10/12/2022
11 we 1 6 1 10/12/2022
12 tell 1 6 1 10/12/2022
13 to 1 6 1 10/12/2022
14 ? 1 6 1 10/12/2022
15 . 1 6 1 10/12/2022
16 investors 1 1 1 11/12/2022
17 invest 1 1 1 11/12/2022
18 may 1 1 1 11/12/2022
19 sometimes 1 1 1 11/12/2022
20 do 1 1 1 12/12/2022
21 spend 1 1 1 12/12/2022
22 what 1 1 1 12/12/2022
23 investor 1 1 1 12/12/2022
You now have access to the frequency per day.
As for question 2, I think you can use textstat_simil
. Something like below. It does give some different answers as using tm::findAssoc
, usually more features. So I'm not completely sure if this is the correct answer. Maybe someone from the quanteda team can confirm or deny this.
my_dfm <- dat %>%
corpus(text_field="text") %>%
tokens() %>%
dfm()
textstat_simil(my_dfm,
my_dfm[, c("investor")],
method = "correlation",
margin = "features",
min_simil = 0.7)
textstat_simil object; method = "correlation"
investor
the .
investors .
and .
their .
supporters .
shall .
invest .
do .
something .
mostly .
we .
tell .
to .
? .
. .
may .
sometimes .
spend 1
what 1
investor 1
You can save the outcome of textstat_simil as a data.frame or list if you want to with as.data.frame
or as.list
.