How do I create two subsets out of a corpus based on multiple keywords?-CodePudding

I am working with a large body of political speeches in quanteda and would like to create two subsets. The first one should contain one or more from a list of specific keywords(e.g. "migrant*", "migration*", "asylum*"). The second one should contain the documents which do not hold any of these terms (the speeches which do not fall into the first subset).

Any input on this would be greatly appreciated. Thanks!

#first suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern=paste0(regex_pattern), ignore_case = TRUE, collapse="|"), "yes", "no")

Warning messages:
1: In (function (case_insensitive, comments, dotall, dot_all = dotall,  :
  Unknown option to `stri_opts_regex`.
2: In stringi::stri_detect_regex(corp_labcon, pattern = paste0(regex_pattern),  :
  longer object length is not a multiple of shorter object length
  
> table(corp_labcon$criteria)

    no    yes 
556921   6139 

#Second suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern = paste0(glob2rx(regex_pattern), collapse = "|")), "yes","no")

> table(corp_labcon$criteria)

    no 
563060

CodePudding user response：

You didn't give a reproducible example, but I will show how it can be done with quanteda and the available corpus data_corpus_inaugural. You can make use of the docvars that you can attach to your corpus. It is just like adding a variable to a data.frame.

With stringi::stri_detect_regex you look inside each document if any of the looked for words is in the text, if so set the value in the criteria column to yes. Otherwise to no. After that you can use corpus_subset to create 2 new corpi based on the criteria values. See example code below.

library(quanteda)

# words used in regex selection
regex_pattern <- c("migrant*", "migration*", "asylum*")

# add selection to corpus
data_corpus_inaugural$criteria <- ifelse(stringi::stri_detect_regex(data_corpus_inaugural, 
                                                                    pattern = paste0(regex_pattern, 
                                                                                     collapse = "|")),
                                         "yes","no")

# Check docvars and new criteria column
head(docvars(data_corpus_inaugural))
  Year  President FirstName                 Party criteria
1 1789 Washington    George                  none      yes
2 1793 Washington    George                  none       no
3 1797      Adams      John            Federalist       no
4 1801  Jefferson    Thomas Democratic-Republican       no
5 1805  Jefferson    Thomas Democratic-Republican       no
6 1809    Madison     James Democratic-Republican       no

# split corpus into segment 1 and 2
segment1 <- corpus_subset(data_corpus_inaugural, criteria == "yes")
segment2 <- corpus_subset(data_corpus_inaugural, criteria == "no")

CodePudding user response：

Not sure how your data is organised, but you could try the function grep(). Imagining that the data is a data frame and each line is a text, you could try:

words <- c("migrant", "migration", "asylum")

df[grep(words, df$text),] # This will give you those lines with the words
df[!grep(words, df$text),] # This will give you those lines without the words

Probably though, your data is not structured like this! You should explain better how your data looks like.