I'm analyzing corporate meetings, and I want to measure at what time people in the meetings bring up certain topics. Time meaning the location of the words.
For example, in three meetings, when do people bring up "unionizing" and other words in my dictionary?
df <- data.frame(text = c("we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.", "hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss unionizing tomorrow, today the focus is our Q3 earnings"))
dict <- c("unions", "strike", "unionizing")
Desired output:
text | count | word |
---|---|---|
we're meeting here today... | (location of word) | unionizing |
hi all, unionizing an... | (location of word) | unionizing |
hi all, unionizing an... | (location of word) | strike |
hi all, unionizing an... | (location of word) | unionizing |
we will discuss unionizing tomorrow... | (location of word) | unionizing |
I asked a question about finding the first time a word is used, here, and I tried to modify the code, but was unsuccessful.
CodePudding user response:
library(tidyverse)
library(tidytext)
df <- tibble(text = c("we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.", "hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss unionizing tomorrow, today the focus is our Q3 earnings"))
dict <- tibble(words = c("unions", "strike", "unionizing"))
df %>%
unnest_tokens(output = "words",
input = "text",
drop = FALSE) %>%
group_by(text) %>%
mutate(word_count = row_number()) %>%
ungroup() %>%
inner_join(dict)
#> Joining, by = "words"
#> # A tibble: 5 × 3
#> text words word_count
#> <chr> <chr> <int>
#> 1 we're meeting here today to talk about our earnings. we will… unio… 14
#> 2 hi all, unionizing and the on-going strike is at the top of … unio… 3
#> 3 hi all, unionizing and the on-going strike is at the top of … stri… 8
#> 4 hi all, unionizing and the on-going strike is at the top of … unio… 17
#> 5 we will discuss unionizing tomorrow, today the focus is our … unio… 4
Created on 2022-05-30 by the reprex package (v2.0.1)
CodePudding user response:
Base R solution:
As a single record per observation:
# Create a regular expression to search with:
# search_regex => character scalar
search_regex <- paste0(
dict,
collapse = "|"
)
# For each observation, loop through and then flatten result into a
# data.frame: res => data.frame
res <- do.call(
rbind,
lapply(
df$text,
function(x){
# Create an ordered vector of the words in observation:
# vec_of_words => character vector
vec_of_words <- unlist(
strsplit(
x,
"\\s "
)
)
# Compute the index where any of the search are found in the vector:
# idx => integer vector
idx <- which(
grepl(
search_regex,
vec_of_words,
ignore.case = TRUE
)
)
# Create a data.frame containing the desired result:
# data.frame => env
data.frame(
# Assign the observation to the text vector:
# text => character vector
text = x,
# Create a string containing the index of matching words:
# count => character vector
count = paste0(
idx,
collapse = ", "
),
# Create a vector of matched words: words => character vector
words = paste0(
vec_of_words[idx],
collapse = ", "
),
row.names = NULL,
stringsAsFactors = FALSE
)
}
)
)
With a new record per matched word:
# Create a regular expression to search with:
# search_regex => character scalar
search_regex <- paste0(
dict,
collapse = "|"
)
# For each observation, loop through and then flatten result into a
# data.frame: res => data.frame
res <- do.call(
rbind,
lapply(
df$text,
function(x){
# Create an ordered vector of the words in observation:
# vec_of_words => character vector
vec_of_words <- unlist(
strsplit(
x,
"\\s "
)
)
# Compute the index where any of the search are found in the vector:
# idx => integer vector
idx <- which(
grepl(
search_regex,
vec_of_words,
ignore.case = TRUE
)
)
# Create a data.frame containing the desired result:
# data.frame => env
data.frame(
# Assign the observation to the text vector:
# text => character vector
text = x,
# Create a string containing the index of matching words:
# count => integer vector
count = idx,
# Create a vector of matched words: words => character vector
words = vec_of_words[idx],
row.names = NULL,
stringsAsFactors = FALSE
)
}
)
)
CodePudding user response:
in Base R We can use the 5 lines of code below:
pat <- sprintf("\\b(%s)\\b",paste(dict, collapse = '|'))
words <- regmatches(df$text, gregexpr(pat, df$text))
loc <- Map(pmatch, words, strsplit(df$text, " "))
df1 <- stack(setNames(words, seq_along(words)))
transform(df1, location = unlist(loc), text = df$text[ind])
values ind location text
1 unionizing 1 14 we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.
2 unionizing 2 3 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
3 strike 2 7 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
4 unionizing 2 16 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
5 unionizing 3 4 we will discuss unionizing tomorrow, today the focus is our Q3 earnings
CodePudding user response:
Using quanteda:
Fist tokenize and remove the punctuation, otherwise punctuation will be counted as a token. The advantage of using kwic
is that you can easily see which words came before and after the word(s) you are looking for.
library(quanteda)
x <- kwic(tokens(df$text, remove_punct = T), dict)
data.frame(x)
docname from to pre keyword post pattern
1 text1 14 14 earnings we will also discuss unionizing efforts unionizing
2 text2 3 3 hi all unionizing and the on-going strike is unionizing
3 text2 7 7 all unionizing and the on-going strike is at the top of strike
4 text2 16 16 top of our agenda because unionizing threatens our revenue goals unionizing
5 text3 4 4 we will discuss unionizing tomorrow today the focus is unionizing