How to get the percentage of documents that contain a feature(s)-CodePudding

I'm using this solution(get what percent of documents contain a feature - quanteda) to find the number of documents that contain any one of a group of features in my dataset. As long as the document contains any one of the words, I want it to return TRUE.

I got it to work, but it only works some of the time and I can't figure out why. Removing or adding words works sometimes and not at other times. This is the code I used (the compound phrases have already been "tokens_compound" in the dfm)

thetarget <- c("testing", "test", "example words", "example")

df <- data.frame(docname = docnames(dfm),
                 Year = docvars(dfm, c("Year")),
                 contains_target = rowSums(dfm[, thetarget]) > 0,
                 row.names = NULL)

And the error I get sometimes

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'rowSums': 
Subscript out of bounds

TIA

edit (script to create table showing a year and number of documents containing any of the target words):

 df2 <- df %>%
  mutate_if(is.logical, as.character) %>%
  filter(!str_detect(contains_target, "FALSE")) %>%
  group_by(Year) %>%
    summarise(n = n())

CodePudding user response：

You are getting the error because in some dfm objects you create, not all of the features in thetarget are in the object dfm you have created.

Here's a way to avoid that, using docfreq():

library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

thetarget <- c("nuclear", "congress", "economy", "_not_a_feature_")

dfmat <- tokens(data_corpus_inaugural) %>%
  tokens_select(thetarget) %>%
  dfm()

docfreq(dfmat) / ndoc(dfmat)
##    economy   congress    nuclear 
## 0.52542373 0.49152542 0.08474576

To get the data.frame in the question:

df <- data.frame(
  docname = docnames(dfmat),
  Year = docvars(dfmat, c("Year")),
  contains_target = as.logical(rowSums(dfmat)),
  row.names = NULL
)

head(df)
##           docname Year contains_target
## 1 1789-Washington 1789            TRUE
## 2 1793-Washington 1793           FALSE
## 3      1797-Adams 1797            TRUE
## 4  1801-Jefferson 1801            TRUE
## 5  1805-Jefferson 1805           FALSE
## 6    1809-Madison 1809            TRUE