Check whether a string appears in another in R-CodePudding

I've got a tibble containing sentences like that :

df <- tibble(sentences = c("Bob is looking for something", "Adriana has an umbrella", "Michael is looking at..."))

And another containing a long list of names :

names <- tibble(names = c("Bob", "Mary", "Michael", "John", "Etc."))

I would like to see if the sentences contain a name from the list and add a column to indicate if this is the case and get the following tibble :

wanted_df <- tibble(sentences = c("Bob is looking for something", "Adriana has an umbrella", "Michael is looking at..."), check = c(TRUE, FALSE, TRUE))

So far I've tried that, with no success :

df <- df %>%
mutate(check = grepl(pattern = names$names, x = df$sentences, fixed = TRUE))

And also :

check <- str_detect(names$names %in% df$sentences)

Thanks a lot for any help ;)

CodePudding user response：

You should form a single regex expression in grepl:

df %>% 
  mutate(check = grepl(paste(names$names, collapse = "|"), sentences))

# A tibble: 3 × 2
  sentences                    check
  <chr>                        <lgl>
1 Bob is looking for something TRUE 
2 Adriana has an umbrella      FALSE
3 Michael is looking at...     TRUE

CodePudding user response：

Here is a base R solution.

inx <- sapply(names$names, \(pat) grepl(pat, df$sentences))
inx
#>        Bob  Mary Michael  John  Etc.
#> [1,]  TRUE FALSE   FALSE FALSE FALSE
#> [2,] FALSE FALSE   FALSE FALSE FALSE
#> [3,] FALSE FALSE    TRUE FALSE FALSE

inx <- rowSums(inx) > 0L
df$check <- inx
df
#> # A tibble: 3 × 2
#>   sentences                    check
#>   <chr>                        <lgl>
#> 1 Bob is looking for something TRUE 
#> 2 Adriana has an umbrella      FALSE
#> 3 Michael is looking at...     TRUE

^{Created on 2023-01-11 with reprex v2.0.2}

CodePudding user response：

grep and family expect pattern= to be length 1. Similarly, str_detect needs strings, not a logical vector, and of the same length, so that won't work as-is.

We have a couple of options:

sapply on the names (into a matrix) and see if each row has one or more matches:

df %>%
  mutate(check = rowSums(sapply(names$names, grepl, sentences)) > 0)
# # A tibble: 3 × 2
#   sentences                    check
#   <chr>                        <lgl>
# 1 Bob is looking for something TRUE 
# 2 Adriana has an umbrella      FALSE
# 3 Michael is looking at...     TRUE

(I now see this is in RuiBarradas's answer.)

Do a fuzzy-join on the data using fuzzyjoin:

df %>%
  fuzzyjoin::regex_left_join(names, by = c(sentences = "names")) %>%
  mutate(check = !is.na(names))
# # A tibble: 3 × 3
#   sentences                    names   check
#   <chr>                        <chr>   <lgl>
# 1 Bob is looking for something Bob     TRUE 
# 2 Adriana has an umbrella      NA      FALSE
# 3 Michael is looking at...     Michael TRUE

This method as an advantage that it tells you which pattern (in names) made the match.

CodePudding user response：

Maybe we can try adist colSums like below

df %>%
  mutate(check = colSums(adist(names$names, sentences, fixed = FALSE) == 0) > 0)

which gives

# A tibble: 3 × 2
  sentences                    check
  <chr>                        <lgl>
1 Bob is looking for something TRUE
2 Adriana has an umbrella      FALSE
3 Michael is looking at...     TRUE