Home > other >  How to filter a data frame where only part of the string matches any item in a vector
How to filter a data frame where only part of the string matches any item in a vector


I want to filter my data frame based on whether one column contains text that appears in a vector. The string in each cell of the column is quite long, and I only need the vector item to appear within the string.

I can do this for a single reference e.g.

starwars %>%
  filter(grepl("at", name))

But what if I want to use a vector of references?

attributes = c("at", "oo", "un")

CodePudding user response:

base R option:

library(dplyr) # for starwars dataset
attributes = c("at", "oo", "un")
starwars[grepl(paste(attributes, collapse="|"), starwars$name),]
#> # A tibble: 15 × 14
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Beru White…    165    75 brown   light   blue         47 fema… femin… Tatooi…
#>  2 Palpatine      170    75 grey    pale    yellow       82 male  mascu… Naboo  
#>  3 Nien Nunb      160    68 none    grey    black        NA male  mascu… Sullust
#>  4 Nute Gunray    191    90 none    mottle… red          NA male  mascu… Cato N…
#>  5 Roos Tarpa…    224    82 none    grey    orange       NA male  mascu… Naboo  
#>  6 Watto          137    NA black   blue, … yellow       NA male  mascu… Toydar…
#>  7 Bib Fortuna    180    NA none    pale    pink         NA male  mascu… Ryloth 
#>  8 Ki-Adi-Mun…    198    82 white   pale    yellow       92 male  mascu… Cerea  
#>  9 Yarael Poof    264    NA none    white   yellow       NA male  mascu… Quermia
#> 10 Plo Koon       188    80 none    orange  black        22 male  mascu… Dorin  
#> 11 Dooku          193    80 white   fair    brown       102 male  mascu… Serenno
#> 12 Taun We        213    NA none    grey    black        NA fema… femin… Kamino 
#> 13 Ratts Tyer…     79    15 none    grey, … unknown      NA male  mascu… Aleen …
#> 14 Wat Tambor     193    48 none    green,… unknown      NA male  mascu… Skako  
#> 15 Sly Moore      178    48 none    pale    white        NA <NA>  <NA>   Umbara 
#> # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
#> #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
#> #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

Created on 2022-09-09 with reprex v2.0.2

CodePudding user response:

Use paste with collapse = "|" to detect multiple patterns.

attributes = c("at", "oo", "un")
#paste(attributes, collapse = "|")
#[1] "at|oo|un"

starwars %>%
  filter(grepl(paste(attributes, collapse = "|"), name))

Another way, if you don't want to go by paste:

starwars %>%
  filter(sapply(name, \(x) any(sapply(attributes, str_detect, string = x))))
  • Related