Home > database >  How to match pattern against multiple strings and store in data.frame in R
How to match pattern against multiple strings and store in data.frame in R

Time:09-06

dat1 <- data.frame(id1 = c(1, 1, 2),
          pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.frame(id2 = c(1174, 1231),
          description = c("apple is sweet", "bananass are not"),
          description2 = c("melon", "bananas, sweet yes"))

> dat1
  id1        pattern
1   1          apple
2   1      applejack
3   2 bananas, sweet
> dat2
   id2      description       description2
1 1174   apple is sweet              melon
2 1231 bananass are not bananas, sweet yes

I have two data.frames, dat1 and dat2. I would like to take each pattern in dat1 and search for them in dat2's description and description2 using the regular expression, \\b[pattern]\\b.

Here is my attempt and the desired final output:

description_match <- description2_match <- vector()
for(i in 1:nrow(dat1)){
  for(j in 1:nrow(dat2)){
    search_pattern <- paste0("\\b", dat1$pattern[i], "\\b")
    description_match <- c(description_match, ifelse(grepl(search_pattern, dat2[j, "description"]), 1, 0))
    description2_match <- c(description2_match, ifelse(grepl(search_pattern, dat2[j, "description2"]), 1, 0))
  }
}
final_output <- data.frame(id1 = rep(dat1$id1, each = nrow(dat2)),
                           pattern = rep(dat1$pattern, each = nrow(dat2)),
                           id2 = rep(dat2$id2, length = nrow(dat1) * nrow(dat2)),
                           description_match = description_match,
                           description2_match = description2_match)

> final_output
  id1        pattern  id2 description_match description2_match
1   1          apple 1174                 1                  0
2   1          apple 1231                 0                  0
3   1      applejack 1174                 0                  0
4   1      applejack 1231                 0                  0
5   2 bananas, sweet 1174                 0                  0
6   2 bananas, sweet 1231                 0                  1

This approach is slow and not efficient if dat1 and dat2 have many rows. What's a quicker way to do this so that I can avoid a for loop?

CodePudding user response:

Using outer and Vectorized grepl.

r <- sapply(dat2[-1], \(x)  outer(dat1$pattern, x, Vectorize(grepl)))
cbind(dat1[rep(seq_len(nrow(dat1)), each=nrow(dat2)), ], id2=dat2$id2, r)
#     id1        pattern  id2 description description2
# 1     1          apple 1174           1            0
# 1.1   1          apple 1231           0            0
# 2     1      applejack 1174           0            0
# 2.1   1      applejack 1231           0            0
# 3     2 bananas, sweet 1174           0            0
# 3.1   2 bananas, sweet 1231           0            1

CodePudding user response:

A tidyverse solution with:

  • tidyr::crossing producing all combinations of dat1 and dat2
  • stringr::str_detect pairwise detecting the presence of a pattern in a string.
library(tidyverse)

crossing(dat1, dat2) %>%
  mutate(across(contains('description'), ~  str_detect(.x, sprintf('\\b%s\\b', pattern))))

# A tibble: 6 × 5
    id1 pattern          id2 description description2
  <dbl> <chr>          <dbl>       <int>        <int>
1     1 apple           1174           1            0
2     1 apple           1231           0            0
3     1 applejack       1174           0            0
4     1 applejack       1231           0            0
5     2 bananas, sweet  1174           0            0
6     2 bananas, sweet  1231           0            1

CodePudding user response:

Another option, but may be slower than @jay.sf's option

Your data frames:

dat1 <- data.frame(id1 = c(1, 1, 2),
                   pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.frame(id2 = c(1174, 1231),
                   description = c("apple is sweet", "bananass are not"),
                   description2 = c("melon", "bananas, sweet yes"))

Add a column with the pattern you'd like to use for matching:

dat1$pattern_grep = paste0("\\b", dat1$pattern, "\\b")

Perform a cartesian join: (i.e. join every row of dat2 to each row of dat1)

cj = merge(dat1, dat2, all = T, by = c())

Perform your grepl now:

cj$description_match <- mapply(grepl, cj$pattern_grep, cj$description)*1
cj$description2_match <- mapply(grepl, cj$pattern_grep, cj$description2)*1
  • Think about the mapply as performing the grepl on each row of your data frame
  • Multiplied by 1 to convert the boolean to 1/0

Keep relevant columns:

cj = cj[, c("id1", "pattern", "id2", "description_match", "description2_match")]

  id1        pattern  id2 description_match description2_match
1   1          apple 1174                 1                  0
2   1      applejack 1174                 0                  0
3   2 bananas, sweet 1174                 0                  0
4   1          apple 1231                 0                  0
5   1      applejack 1231                 0                  0
6   2 bananas, sweet 1231                 0                  1
  • Related