I have a dataframe with speech data, like this:
df <- data.frame(
id = 1:12,
partcl = c("yeah yeah yeah absolutely", "well you know it 's", "oh well yeah that's right",
"yeah I mean well oh", "well erm well Peter will be there", "well yeah well",
"yes yes yes totally", "yeah yeah yeah yeah", "well well I did n't do it",
"er well yeah that 's true", "oh hey where 's he gone?", "er"
))
and a vector with key words called parts
:
parts <- c("yeah", "oh", "no", "well", "mm", "yes", "so", "right", "er", "like")
What I need to do is filter those rows with at least two distinct parts
values. What I can do is filter those rows with at least two parts
values, regardless of whether they're distinct or the same:
library(dplyr)
df %>%
filter(
str_count(partcl, paste0("\\b(", paste0(parts, collapse = "|"), ")\\b")) > 1
)
id partcl
1 1 yeah yeah yeah absolutely
2 3 oh well yeah that's right
3 4 yeah I mean well oh
4 5 well erm well Peter will be there
5 6 well yeah well
6 7 yes yes yes totally
7 8 yeah yeah yeah yeah
8 9 well well I did n't do it
9 10 er well yeah that 's true
How can I assert that the matched parts
be distinct so that the result is this:
id partcl
1 3 oh well yeah that's right
2 4 yeah I mean well oh
3 6 well yeah well
4 10 er well yeah that 's true
CodePudding user response:
May be this helps - extract the key words with str_extract_all
, and then do the check with n_distinct
to filter
rows having more than one unique keyword
library(dplyr)
library(stringr)
library(purrr)
df %>%
filter(map_lgl(str_extract_all(partcl,
paste0("\\b(", paste0(parts, collapse = "|"), ")\\b")),
~ n_distinct(.x) > 1))
-output
id partcl
1 3 oh well yeah that's right
2 4 yeah I mean well oh
3 6 well yeah well
4 10 er well yeah that 's true
CodePudding user response:
You can iterate over parts
with sapply()
to check df$partcl
for occurrences of the keywords. The paste0("\\b", x, "\\b")
part ensures that we only detect full words, otherwise "so" will also be found in "absolutely" for example. rowSums()
creates a vector we can add to df
and we can then dplyr::filter()
the desired rows.
library(dplyr)
df$distinct_parts_count <-
sapply(parts, \(x) grepl(paste0("\\b", x, "\\b"), df$partcl)) |>
rowSums()
df |>
filter(distinct_parts_count >= 2)
#> id partcl distinct_parts_count
#> 1 3 oh well yeah that's right 4
#> 2 4 yeah I mean well oh 3
#> 3 6 well yeah well 2
#> 4 10 er well yeah that 's true 3