This is a follow up on r - for loop to compare 2 dataframes all by all rows but with application of different matching function (stringr::str_detect()
).
I'm trying to compare 2 columns from dataframe d1
with 2 columns from dataframe d2
, row by row. To illustrate the issue I created dummy datasets:
d1 <- data.frame(
a = c("a","b","c", "d"),
b = c("aa", "bbb", "ccc", "d")
)
d2 <- data.frame(
a = c("a", "a", "d", "b"),
b = c("aaa", "bbb", "ddd", "bbb")
)
Ideally, I would like to flag all rows in d1
for which I can find a match in at least one row of d2
. This is what provides the desired result:
output <- matrix(nrow = nrow(d1), ncol =nrow(d2))
for (i in 1:nrow(d1)) {
for (j in 1:nrow(d2)) {
output[i,j] = ifelse(
d1$a[i] == d2$a[j] & str_detect(d2$b[j], d1$b[i]),
1,
0)
}
}
d1$flag <- apply(output,1,max)
d1
Since the number of rows of my d1 table can become rather large consequently the size of the matrix will become huge as well. Is there a better way of writing this loop, so that instead of the matrix i create a vector with the max values per row and the memory doesn't explode ?
CodePudding user response:
Efficient join
@Roland hinted at one good solution: fuzzyjoin::fuzzy_*_join
. The only issue with it is that it is still a join, which the OP wants to avoid ... frankly, most of the merge
and *_join
functions out there are relatively efficient in that comparisons are done before exploding every-row against every-row; inefficiency comes when the constraints are too loose and it nears a cartesian join.
Going with that, I'll modify d2
so that we have at least one two-row match:
d2 <- data.frame(
a = c("a", "a", "d", "b", "b"),
b = c("aaa", "bbb", "ddd", "bbb", "bbb")
)
From here, we need to add in a row id, join, then aggregate on that row id.
d1$rowid <- seq_len(nrow(d1))
out <- fuzzyjoin::fuzzy_left_join(d1, d2, by = c("a", "b"), match_fun = list(`==`, Vectorize(grepl)))
out
# a.x b.x rowid a.y b.y
# 1 a aa 1 a aaa
# 2 b bbb 2 b bbb
# 3 b bbb 2 b bbb
# 4 c ccc 3 <NA> <NA>
# 5 d d 4 d ddd
With that, notice that rowid 2 has two occurrences (so we need to combine them), and rowid 3 is unmatched. (Side note: fuzzyjoin
by design duplicates all "by" columns. It's easy enough to reduce them, even if it seems odd.)
out2 <- aggregate(a.y ~ rowid a.x b.x, data = out, na.action = na.pass,
FUN = function(z) sum(!is.na(z)))
names(out2)[-1] <- c("a", "b", "flag")
out2
# rowid a b flag
# 1 1 a aa 1
# 2 2 b bbb 2
# 3 3 c ccc 0
# 4 4 d d 1
By-Row
If this causes memory problems, then the next step would be to do this by-row for one of the frames.
d1$flag <- mapply(function(A1, B1) sum(A1 == d2$a & grepl(B1, d2$b)), d1$a, d1$b)
d1
# a b flag
# 1 a aa 1
# 2 b bbb 2
# 3 c ccc 0
# 4 d d 1
CodePudding user response:
Wanted to create a solution using str_detect
and for loops
but truthfully it's a lot less efficient than solutions provided both here and in your original post:
d1$flag <- 0
for (i in 1:nrow(d1)) {
flag <- list()
for (j in 1:nrow(d2)) {
flag <- rbind(flag, stringr::str_detect(paste(d1[i,1], d1[i,2]), paste(d2[j,1], d2[j,2])))
}
if (any(flag == TRUE)) {
d1[i, 3] <- 1
}
}
If you don't want to work with column indices you can replace them with "column name", just make sure to put quotes around it, i.e. d1[i, "flag"]
.