dat1 <- data.frame(id1 = c(1, 1, 2),
pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.frame(id2 = c(1174, 1231),
description = c("apple is sweet", "bananass are not"),
description2 = c("melon", "bananas, sweet yes"))
> dat1
id1 pattern
1 1 apple
2 1 applejack
3 2 bananas, sweet
> dat2
id2 description description2
1 1174 apple is sweet melon
2 1231 bananass are not bananas, sweet yes
I have two data.frame
s, dat1
and dat2
. I would like to take each pattern
in dat1
and search for them in dat2
's description
and description2
using the regular expression, \\b[pattern]\\b
.
Here is my attempt and the desired final output:
description_match <- description2_match <- vector()
for(i in 1:nrow(dat1)){
for(j in 1:nrow(dat2)){
search_pattern <- paste0("\\b", dat1$pattern[i], "\\b")
description_match <- c(description_match, ifelse(grepl(search_pattern, dat2[j, "description"]), 1, 0))
description2_match <- c(description2_match, ifelse(grepl(search_pattern, dat2[j, "description2"]), 1, 0))
}
}
final_output <- data.frame(id1 = rep(dat1$id1, each = nrow(dat2)),
pattern = rep(dat1$pattern, each = nrow(dat2)),
id2 = rep(dat2$id2, length = nrow(dat1) * nrow(dat2)),
description_match = description_match,
description2_match = description2_match)
> final_output
id1 pattern id2 description_match description2_match
1 1 apple 1174 1 0
2 1 apple 1231 0 0
3 1 applejack 1174 0 0
4 1 applejack 1231 0 0
5 2 bananas, sweet 1174 0 0
6 2 bananas, sweet 1231 0 1
This approach is slow and not efficient if dat1
and dat2
have many rows. What's a quicker way to do this so that I can avoid a for
loop?
CodePudding user response:
Using outer
and Vectorize
d grepl
.
r <- sapply(dat2[-1], \(x) outer(dat1$pattern, x, Vectorize(grepl)))
cbind(dat1[rep(seq_len(nrow(dat1)), each=nrow(dat2)), ], id2=dat2$id2, r)
# id1 pattern id2 description description2
# 1 1 apple 1174 1 0
# 1.1 1 apple 1231 0 0
# 2 1 applejack 1174 0 0
# 2.1 1 applejack 1231 0 0
# 3 2 bananas, sweet 1174 0 0
# 3.1 2 bananas, sweet 1231 0 1
CodePudding user response:
A tidyverse
solution with:
tidyr::crossing
producing all combinations ofdat1
anddat2
stringr::str_detect
pairwise detecting the presence of a pattern in a string.
library(tidyverse)
crossing(dat1, dat2) %>%
mutate(across(contains('description'), ~ str_detect(.x, sprintf('\\b%s\\b', pattern))))
# A tibble: 6 × 5
id1 pattern id2 description description2
<dbl> <chr> <dbl> <int> <int>
1 1 apple 1174 1 0
2 1 apple 1231 0 0
3 1 applejack 1174 0 0
4 1 applejack 1231 0 0
5 2 bananas, sweet 1174 0 0
6 2 bananas, sweet 1231 0 1
CodePudding user response:
Another option, but may be slower than @jay.sf's option
Your data frames:
dat1 <- data.frame(id1 = c(1, 1, 2),
pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.frame(id2 = c(1174, 1231),
description = c("apple is sweet", "bananass are not"),
description2 = c("melon", "bananas, sweet yes"))
Add a column with the pattern you'd like to use for matching:
dat1$pattern_grep = paste0("\\b", dat1$pattern, "\\b")
Perform a cartesian join: (i.e. join every row of dat2 to each row of dat1)
cj = merge(dat1, dat2, all = T, by = c())
Perform your grepl
now:
cj$description_match <- mapply(grepl, cj$pattern_grep, cj$description)*1
cj$description2_match <- mapply(grepl, cj$pattern_grep, cj$description2)*1
- Think about the
mapply
as performing thegrepl
on each row of your data frame - Multiplied by 1 to convert the boolean to 1/0
Keep relevant columns:
cj = cj[, c("id1", "pattern", "id2", "description_match", "description2_match")]
id1 pattern id2 description_match description2_match
1 1 apple 1174 1 0
2 1 applejack 1174 0 0
3 2 bananas, sweet 1174 0 0
4 1 apple 1231 0 0
5 1 applejack 1231 0 0
6 2 bananas, sweet 1231 0 1