Home > OS >  Check whether a string appears in another in R
Check whether a string appears in another in R

Time:01-12

I've got a tibble containing sentences like that :

df <- tibble(sentences = c("Bob is looking for something", "Adriana has an umbrella", "Michael is looking at..."))

And another containing a long list of names :

names <- tibble(names = c("Bob", "Mary", "Michael", "John", "Etc."))

I would like to see if the sentences contain a name from the list and add a column to indicate if this is the case and get the following tibble :

wanted_df <- tibble(sentences = c("Bob is looking for something", "Adriana has an umbrella", "Michael is looking at..."), check = c(TRUE, FALSE, TRUE))

So far I've tried that, with no success :

df <- df %>%
mutate(check = grepl(pattern = names$names, x = df$sentences, fixed = TRUE))

And also :

check <- str_detect(names$names %in% df$sentences)

Thanks a lot for any help ;)

CodePudding user response:

You should form a single regex expression in grepl:

df %>% 
  mutate(check = grepl(paste(names$names, collapse = "|"), sentences))

# A tibble: 3 × 2
  sentences                    check
  <chr>                        <lgl>
1 Bob is looking for something TRUE 
2 Adriana has an umbrella      FALSE
3 Michael is looking at...     TRUE 

CodePudding user response:

Here is a base R solution.

inx <- sapply(names$names, \(pat) grepl(pat, df$sentences))
inx
#>        Bob  Mary Michael  John  Etc.
#> [1,]  TRUE FALSE   FALSE FALSE FALSE
#> [2,] FALSE FALSE   FALSE FALSE FALSE
#> [3,] FALSE FALSE    TRUE FALSE FALSE

inx <- rowSums(inx) > 0L
df$check <- inx
df
#> # A tibble: 3 × 2
#>   sentences                    check
#>   <chr>                        <lgl>
#> 1 Bob is looking for something TRUE 
#> 2 Adriana has an umbrella      FALSE
#> 3 Michael is looking at...     TRUE

Created on 2023-01-11 with reprex v2.0.2

CodePudding user response:

grep and family expect pattern= to be length 1. Similarly, str_detect needs strings, not a logical vector, and of the same length, so that won't work as-is.

We have a couple of options:

  • sapply on the names (into a matrix) and see if each row has one or more matches:

    df %>%
      mutate(check = rowSums(sapply(names$names, grepl, sentences)) > 0)
    # # A tibble: 3 × 2
    #   sentences                    check
    #   <chr>                        <lgl>
    # 1 Bob is looking for something TRUE 
    # 2 Adriana has an umbrella      FALSE
    # 3 Michael is looking at...     TRUE 
    

    (I now see this is in RuiBarradas's answer.)

  • Do a fuzzy-join on the data using fuzzyjoin:

    df %>%
      fuzzyjoin::regex_left_join(names, by = c(sentences = "names")) %>%
      mutate(check = !is.na(names))
    # # A tibble: 3 × 3
    #   sentences                    names   check
    #   <chr>                        <chr>   <lgl>
    # 1 Bob is looking for something Bob     TRUE 
    # 2 Adriana has an umbrella      NA      FALSE
    # 3 Michael is looking at...     Michael TRUE 
    

    This method as an advantage that it tells you which pattern (in names) made the match.

CodePudding user response:

Maybe we can try adist colSums like below

df %>%
  mutate(check = colSums(adist(names$names, sentences, fixed = FALSE) == 0) > 0)

which gives

# A tibble: 3 × 2
  sentences                    check
  <chr>                        <lgl>
1 Bob is looking for something TRUE
2 Adriana has an umbrella      FALSE
3 Michael is looking at...     TRUE
  • Related