I have a very large dataset containing names of both sexes. Another dataset contains the most common girl names. My goal is to filter out all names that contain a common girl name. How to do it with tidyverse?
This is my best try on simulated data.
library(tidyverse)
names =tibble(name = c("Mark", "Roger", "May", "Emma", "Angelina", "Emma-Anna"))
girl_names = tibble(name = c("May", "Emma", "Angelina"))
names %>%
filter(!str_detect(name, ??))
CodePudding user response:
Try this:
library(tidyverse)
names %>%
filter(str_detect(name, paste0("^(", paste0(girl_names$name, collapse = "|"), ")$")))
name
<chr>
1 May
2 Emma
3 Angelina
The important point here is this:
paste0("^(", paste0(girl_names$name, collapse = "|"), ")$")
[1] "^(May|Emma|Angelina)$"
By including the anchors ^
and $
we assert that the strings actually end where the girls' names end thus avoiding to also match compound names such as Emma-Anna
EDIT
If compound names are also to be matched:
names %>%
filter(str_detect(name, paste0(girl_names$name, collapse = "|")))
# A tibble: 4 × 1
name
<chr>
1 May
2 Emma
3 Angelina
4 Emma-Anna
CodePudding user response:
If girl_names
and names
are data.frame
s, then we can use dplyr::semi_join
:
library(dplyr)
names = tibble(name = c("Mark", "Roger", "May", "Emma", "Angelina", "Emma-Anna"))
girl_names = tibble(name = c("May", "Emma", "Angelina"))
names %>%
semi_join(girl_names, bny = "name")
#> Joining, by = "name"
#> # A tibble: 3 x 1
#> name
#> <chr>
#> 1 May
#> 2 Emma
#> 3 Angelina
Created on 2022-08-04 by the reprex package (v2.0.1)
If, however, girl_names
is a vector, we can use match
or its wrapper %in%
:
library(dplyr)
girl_names_vec = c("May", "Emma", "Angelina")
names %>%
filter(name %in% girl_names_vec)
#> # A tibble: 3 x 1
#> name
#> <chr>
#> 1 May
#> 2 Emma
#> 3 Angelina
Created on 2022-08-04 by the reprex package (v2.0.1)
If you want to detect any part of the name and account for compound names, then a dplyr::rowwise()
approach with strsplit
could look like below:
library(dplyr)
girl_names <- tibble(name = c("May", "Emma", "Angelina"))
names %>%
mutate(name2 = strsplit(name, "[_\\s-]")) %>%
rowwise() %>%
filter(any(unlist(name2) %in% girl_names$name)) %>%
select(!name2)
#> # A tibble: 4 x 1
#> # Rowwise:
#> name
#> <chr>
#> 1 May
#> 2 Emma
#> 3 Angelina
#> 4 Emma-Anna
Created on 2022-08-04 by the reprex package (v2.0.1)