I've got a tibble containing sentences like that :
df <- tibble(sentences = c("Bob is looking for something", "Adriana has an umbrella", "Michael is looking at..."))
And another containing a long list of names :
names <- tibble(names = c("Bob", "Mary", "Michael", "John", "Etc."))
I would like to see if the sentences contain a name from the list and add a column to indicate if this is the case and get the following tibble :
wanted_df <- tibble(sentences = c("Bob is looking for something", "Adriana has an umbrella", "Michael is looking at..."), check = c(TRUE, FALSE, TRUE))
So far I've tried that, with no success :
df <- df %>%
mutate(check = grepl(pattern = names$names, x = df$sentences, fixed = TRUE))
And also :
check <- str_detect(names$names %in% df$sentences)
Thanks a lot for any help ;)
CodePudding user response:
You should form a single regex expression in grepl
:
df %>%
mutate(check = grepl(paste(names$names, collapse = "|"), sentences))
# A tibble: 3 × 2
sentences check
<chr> <lgl>
1 Bob is looking for something TRUE
2 Adriana has an umbrella FALSE
3 Michael is looking at... TRUE
CodePudding user response:
Here is a base R solution.
inx <- sapply(names$names, \(pat) grepl(pat, df$sentences))
inx
#> Bob Mary Michael John Etc.
#> [1,] TRUE FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE TRUE FALSE FALSE
inx <- rowSums(inx) > 0L
df$check <- inx
df
#> # A tibble: 3 × 2
#> sentences check
#> <chr> <lgl>
#> 1 Bob is looking for something TRUE
#> 2 Adriana has an umbrella FALSE
#> 3 Michael is looking at... TRUE
Created on 2023-01-11 with reprex v2.0.2
CodePudding user response:
grep
and family expect pattern=
to be length 1. Similarly, str_detect
needs strings, not a logical vector, and of the same length, so that won't work as-is.
We have a couple of options:
sapply
on the names (into a matrix) and see if each row has one or more matches:df %>% mutate(check = rowSums(sapply(names$names, grepl, sentences)) > 0) # # A tibble: 3 × 2 # sentences check # <chr> <lgl> # 1 Bob is looking for something TRUE # 2 Adriana has an umbrella FALSE # 3 Michael is looking at... TRUE
(I now see this is in RuiBarradas's answer.)
Do a fuzzy-join on the data using
fuzzyjoin
:df %>% fuzzyjoin::regex_left_join(names, by = c(sentences = "names")) %>% mutate(check = !is.na(names)) # # A tibble: 3 × 3 # sentences names check # <chr> <chr> <lgl> # 1 Bob is looking for something Bob TRUE # 2 Adriana has an umbrella NA FALSE # 3 Michael is looking at... Michael TRUE
This method as an advantage that it tells you which pattern (in
names
) made the match.
CodePudding user response:
Maybe we can try adist
colSums
like below
df %>%
mutate(check = colSums(adist(names$names, sentences, fixed = FALSE) == 0) > 0)
which gives
# A tibble: 3 × 2
sentences check
<chr> <lgl>
1 Bob is looking for something TRUE
2 Adriana has an umbrella FALSE
3 Michael is looking at... TRUE