Filter rows on the condition that at least two distinct key words must be present-CodePudding

I have a dataframe with speech data, like this:

df <- data.frame(
  id = 1:12,
  partcl = c("yeah yeah yeah absolutely", "well you know it 's", "oh well yeah that's right", 
             "yeah I mean well oh", "well erm well Peter will be there", "well yeah well", 
             "yes yes yes totally", "yeah yeah yeah yeah", "well well I did n't do it", 
             "er well yeah that 's true", "oh hey where 's he gone?", "er"
))

and a vector with key words called parts:

parts <- c("yeah", "oh", "no", "well", "mm", "yes", "so", "right", "er", "like")

What I need to do is filter those rows with at least two distinct parts values. What I can do is filter those rows with at least two parts values, regardless of whether they're distinct or the same:

library(dplyr)   
df %>%
  filter(
    str_count(partcl, paste0("\\b(", paste0(parts, collapse = "|"), ")\\b")) > 1
  )
  id                            partcl
1  1         yeah yeah yeah absolutely
2  3         oh well yeah that's right
3  4               yeah I mean well oh
4  5 well erm well Peter will be there
5  6                    well yeah well
6  7               yes yes yes totally
7  8               yeah yeah yeah yeah
8  9         well well I did n't do it
9 10         er well yeah that 's true

How can I assert that the matched partsbe distinct so that the result is this:

  id                            partcl
1  3         oh well yeah that's right
2  4               yeah I mean well oh
3  6                    well yeah well
4 10         er well yeah that 's true

CodePudding user response：

May be this helps - extract the key words with str_extract_all, and then do the check with n_distinct to filter rows having more than one unique keyword

library(dplyr)
library(stringr)
library(purrr)
df %>% 
  filter(map_lgl(str_extract_all(partcl, 
    paste0("\\b(", paste0(parts, collapse = "|"), ")\\b")), 
    ~  n_distinct(.x) > 1))

-output

 id                    partcl
1  3 oh well yeah that's right
2  4       yeah I mean well oh
3  6            well yeah well
4 10 er well yeah that 's true

CodePudding user response：

You can iterate over parts with sapply() to check df$partcl for occurrences of the keywords. The paste0("\\b", x, "\\b") part ensures that we only detect full words, otherwise "so" will also be found in "absolutely" for example. rowSums() creates a vector we can add to df and we can then dplyr::filter() the desired rows.

library(dplyr)

df$distinct_parts_count <- 
  sapply(parts, \(x) grepl(paste0("\\b", x, "\\b"), df$partcl)) |> 
  rowSums()

df |> 
  filter(distinct_parts_count >= 2)
#>   id                    partcl distinct_parts_count
#> 1  3 oh well yeah that's right                    4
#> 2  4       yeah I mean well oh                    3
#> 3  6            well yeah well                    2
#> 4 10 er well yeah that 's true                    3