Hello guys I hope you guys are having a great day!
I have been using the AWESOME quanteda library for text analysis lately and tis been quite a joy, recently I have stumbled with a task and that is to use a dicctionary relating words to a numeric sentiment score to summarize a meassure per document called: NetSentScore which is calculating in the following manner:
NetSentScore per document= sum(Positive_wordscore) sum(Negative_wordscore)
I have the following dicctionary:
ScoreDict<- tibble::tibble(
score= c(-5,-9,1,8,9,-10),
word = c("bad", "horrible", "open","awesome","gorgeous","trash")
)
My corpus
text<-c("this is a bad movie very bad","horrible movie, just awful","im open to new dreams",
"awesome place i loved it","she is gorgeous","that is trash")
by definition quanteda will not allow to have numeric data in a dictionary, but I can have this:
> text %>%
corpus() %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm()
Document-feature matrix of: 6 documents, 14 features (82.14% sparse) and 0 docvars.
features
docs bad movie horrible just awful im open new dreams awesome
text1 2 1 0 0 0 0 0 0 0 0
text2 0 1 1 1 1 0 0 0 0 0
text3 0 0 0 0 0 1 1 1 1 0
text4 0 0 0 0 0 0 0 0 0 1
text5 0 0 0 0 0 0 0 0 0 0
text6 0 0 0 0 0 0 0 0 0 0
[ reached max_nfeat ... 4 more features ]
which gives me the number or times a word was found in a document, I will only need to "join" or "merge" with my dicctionary so I have have the score by each word and then compute the NetSentScore, is there a way to do this in quanteda? please keep in mind that I do have a quite masisve large corpus so converting my dfm to a dataframe will make the RAM die as I have over 500k documents and aprox 800 features.
to ilustrate the NetSentScore of text1 will be:
2*-5 0=-10, this is because the word bad appears two times and according to the dicctionary it has a score of -5
Thank you so much love from Italy!
CodePudding user response:
As @stomper suggests, you can do this with the quanteda.sentiment package, by setting the numeric values as "valences" for the dictionary. Here's how to do it.
This ought to work on 500k documents but of course this will depend on your machine's capacity.
library("quanteda")
#> Package version: 3.2.4
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.sentiment")
#>
#> Attaching package: 'quanteda.sentiment'
#> The following object is masked from 'package:quanteda':
#>
#> data_dictionary_LSD2015
dict <- dictionary(list(
sentiment = c("bad", "horrible", "open", "awesome", "gorgeous", "trash")
))
valence(dict) <- list(
sentiment = c(bad = -5,
horrible = -9,
open = 1,
awesome = 8, gorgeous = 9,
trash = -10)
)
print(dict)
#> Dictionary object with 1 key entry.
#> Valences set for keys: sentiment
#> - [sentiment]:
#> - bad, horrible, open, awesome, gorgeous, trash
text <- c("this is a bad movie very bad",
"horrible movie, just awful",
"im open to new dreams",
"awesome place i loved it",
"she is gorgeous",
"that is trash")
Now to compute the document scores, you use textstat_valence()
but you sent the normalisation to "none" in order to sum the valences rather than average them. Normalisation is the default because raw sums are affected by documents having different lengths, but as this package is still in a developmental stage, it's easy to imagine that other choices might be preferable to the default.
textstat_valence(tokens(text), dictionary = dict, normalization = "none")
#> doc_id sentiment
#> 1 text1 -10
#> 2 text2 -9
#> 3 text3 1
#> 4 text4 8
#> 5 text5 9
#> 6 text6 -10
Created on 2023-01-11 with reprex v2.0.2