I am working with a large body of political speeches in quanteda and would like to create two subsets. The first one should contain one or more from a list of specific keywords(e.g. "migrant*", "migration*", "asylum*"). The second one should contain the documents which do not hold any of these terms (the speeches which do not fall into the first subset).
Any input on this would be greatly appreciated. Thanks!
#first suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern=paste0(regex_pattern), ignore_case = TRUE, collapse="|"), "yes", "no")
Warning messages:
1: In (function (case_insensitive, comments, dotall, dot_all = dotall, :
Unknown option to `stri_opts_regex`.
2: In stringi::stri_detect_regex(corp_labcon, pattern = paste0(regex_pattern), :
longer object length is not a multiple of shorter object length
> table(corp_labcon$criteria)
no yes
556921 6139
#Second suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern = paste0(glob2rx(regex_pattern), collapse = "|")), "yes","no")
> table(corp_labcon$criteria)
no
563060
CodePudding user response:
You didn't give a reproducible example, but I will show how it can be done with quanteda and the available corpus data_corpus_inaugural. You can make use of the docvars that you can attach to your corpus. It is just like adding a variable to a data.frame.
With stringi::stri_detect_regex
you look inside each document if any of the looked for words is in the text, if so set the value in the criteria column to yes. Otherwise to no. After that you can use corpus_subset
to create 2 new corpi based on the criteria values. See example code below.
library(quanteda)
# words used in regex selection
regex_pattern <- c("migrant*", "migration*", "asylum*")
# add selection to corpus
data_corpus_inaugural$criteria <- ifelse(stringi::stri_detect_regex(data_corpus_inaugural,
pattern = paste0(regex_pattern,
collapse = "|")),
"yes","no")
# Check docvars and new criteria column
head(docvars(data_corpus_inaugural))
Year President FirstName Party criteria
1 1789 Washington George none yes
2 1793 Washington George none no
3 1797 Adams John Federalist no
4 1801 Jefferson Thomas Democratic-Republican no
5 1805 Jefferson Thomas Democratic-Republican no
6 1809 Madison James Democratic-Republican no
# split corpus into segment 1 and 2
segment1 <- corpus_subset(data_corpus_inaugural, criteria == "yes")
segment2 <- corpus_subset(data_corpus_inaugural, criteria == "no")
CodePudding user response:
Not sure how your data is organised, but you could try the function grep(). Imagining that the data is a data frame and each line is a text, you could try:
words <- c("migrant", "migration", "asylum")
df[grep(words, df$text),] # This will give you those lines with the words
df[!grep(words, df$text),] # This will give you those lines without the words
Probably though, your data is not structured like this! You should explain better how your data looks like.