str_detect: how to test for hundreds of substrings?-CodePudding

I have a very large dataset containing names of both sexes. Another dataset contains the most common girl names. My goal is to filter out all names that contain a common girl name. How to do it with tidyverse?

This is my best try on simulated data.

library(tidyverse)

names =tibble(name = c("Mark", "Roger", "May", "Emma", "Angelina", "Emma-Anna"))

girl_names = tibble(name = c("May", "Emma", "Angelina"))

names %>% 
  filter(!str_detect(name, ??))

CodePudding user response：

Try this:

library(tidyverse)
names %>% 
  filter(str_detect(name, paste0("^(", paste0(girl_names$name, collapse = "|"), ")$")))
  name    
  <chr>   
1 May     
2 Emma    
3 Angelina

The important point here is this:

paste0("^(", paste0(girl_names$name, collapse = "|"), ")$")
[1] "^(May|Emma|Angelina)$"

By including the anchors ^ and $ we assert that the strings actually end where the girls' names end thus avoiding to also match compound names such as Emma-Anna

EDIT

If compound names are also to be matched:

names %>% 
  filter(str_detect(name, paste0(girl_names$name, collapse = "|")))
# A tibble: 4 × 1
  name     
  <chr>    
1 May      
2 Emma     
3 Angelina 
4 Emma-Anna

CodePudding user response：

If girl_names and names are data.frames, then we can use dplyr::semi_join:

library(dplyr) 

names = tibble(name = c("Mark", "Roger", "May", "Emma", "Angelina", "Emma-Anna"))

girl_names = tibble(name = c("May", "Emma", "Angelina"))

names %>% 
  semi_join(girl_names, bny = "name")

#> Joining, by = "name"
#> # A tibble: 3 x 1
#>   name    
#>   <chr>   
#> 1 May     
#> 2 Emma    
#> 3 Angelina

^{Created on 2022-08-04 by the reprex package (v2.0.1)}

If, however, girl_names is a vector, we can use match or its wrapper %in%:

library(dplyr)

girl_names_vec = c("May", "Emma", "Angelina")

names %>% 
  filter(name %in% girl_names_vec)
#> # A tibble: 3 x 1
#>   name    
#>   <chr>   
#> 1 May     
#> 2 Emma    
#> 3 Angelina

^{Created on 2022-08-04 by the reprex package (v2.0.1)}

If you want to detect any part of the name and account for compound names, then a dplyr::rowwise() approach with strsplit could look like below:

library(dplyr)

girl_names <- tibble(name = c("May", "Emma", "Angelina"))

names %>% 
  mutate(name2 = strsplit(name, "[_\\s-]")) %>%
  rowwise() %>% 
  filter(any(unlist(name2) %in% girl_names$name)) %>% 
  select(!name2)

#> # A tibble: 4 x 1
#> # Rowwise: 
#>   name     
#>   <chr>    
#> 1 May      
#> 2 Emma     
#> 3 Angelina 
#> 4 Emma-Anna

^{Created on 2022-08-04 by the reprex package (v2.0.1)}