I'm using this solution(get what percent of documents contain a feature - quanteda) to find the number of documents that contain any one of a group of features in my dataset. As long as the document contains any one of the words, I want it to return TRUE.
I got it to work, but it only works some of the time and I can't figure out why. Removing or adding words works sometimes and not at other times. This is the code I used (the compound phrases have already been "tokens_compound" in the dfm)
thetarget <- c("testing", "test", "example words", "example")
df <- data.frame(docname = docnames(dfm),
Year = docvars(dfm, c("Year")),
contains_target = rowSums(dfm[, thetarget]) > 0,
row.names = NULL)
And the error I get sometimes
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'rowSums':
Subscript out of bounds
TIA
edit (script to create table showing a year and number of documents containing any of the target words):
df2 <- df %>%
mutate_if(is.logical, as.character) %>%
filter(!str_detect(contains_target, "FALSE")) %>%
group_by(Year) %>%
summarise(n = n())
CodePudding user response:
You are getting the error because in some dfm objects you create, not all of the features in thetarget
are in the object dfm
you have created.
Here's a way to avoid that, using docfreq()
:
library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
thetarget <- c("nuclear", "congress", "economy", "_not_a_feature_")
dfmat <- tokens(data_corpus_inaugural) %>%
tokens_select(thetarget) %>%
dfm()
docfreq(dfmat) / ndoc(dfmat)
## economy congress nuclear
## 0.52542373 0.49152542 0.08474576
To get the data.frame in the question:
df <- data.frame(
docname = docnames(dfmat),
Year = docvars(dfmat, c("Year")),
contains_target = as.logical(rowSums(dfmat)),
row.names = NULL
)
head(df)
## docname Year contains_target
## 1 1789-Washington 1789 TRUE
## 2 1793-Washington 1793 FALSE
## 3 1797-Adams 1797 TRUE
## 4 1801-Jefferson 1801 TRUE
## 5 1805-Jefferson 1805 FALSE
## 6 1809-Madison 1809 TRUE