For loop to compare 2 dataframes all by all rows with custom matching functions-CodePudding

This is a follow up on r - for loop to compare 2 dataframes all by all rows but with application of different matching function (stringr::str_detect()).

I'm trying to compare 2 columns from dataframe d1 with 2 columns from dataframe d2, row by row. To illustrate the issue I created dummy datasets:

d1 <- data.frame(
  a = c("a","b","c", "d"),
  b = c("aa", "bbb", "ccc", "d")
)

d2 <- data.frame(
  a = c("a", "a", "d", "b"),
  b = c("aaa", "bbb", "ddd", "bbb")
)

Ideally, I would like to flag all rows in d1 for which I can find a match in at least one row of d2. This is what provides the desired result:

output <- matrix(nrow = nrow(d1), ncol =nrow(d2))
for (i in 1:nrow(d1)) {
  for (j in 1:nrow(d2)) {
    output[i,j] = ifelse(
      d1$a[i] == d2$a[j] & str_detect(d2$b[j], d1$b[i]), 
      1, 
      0)
  }
}

d1$flag <- apply(output,1,max)
d1

Since the number of rows of my d1 table can become rather large consequently the size of the matrix will become huge as well. Is there a better way of writing this loop, so that instead of the matrix i create a vector with the max values per row and the memory doesn't explode ?

CodePudding user response：

Efficient join

@Roland hinted at one good solution: fuzzyjoin::fuzzy_*_join. The only issue with it is that it is still a join, which the OP wants to avoid ... frankly, most of the merge and *_join functions out there are relatively efficient in that comparisons are done before exploding every-row against every-row; inefficiency comes when the constraints are too loose and it nears a cartesian join.

Going with that, I'll modify d2 so that we have at least one two-row match:

d2 <- data.frame(
  a = c("a", "a", "d", "b", "b"),
  b = c("aaa", "bbb", "ddd", "bbb", "bbb")
)

From here, we need to add in a row id, join, then aggregate on that row id.

d1$rowid <- seq_len(nrow(d1))
out <- fuzzyjoin::fuzzy_left_join(d1, d2, by = c("a", "b"), match_fun = list(`==`, Vectorize(grepl)))
out
#   a.x b.x rowid  a.y  b.y
# 1   a  aa     1    a  aaa
# 2   b bbb     2    b  bbb
# 3   b bbb     2    b  bbb
# 4   c ccc     3 <NA> <NA>
# 5   d   d     4    d  ddd

With that, notice that rowid 2 has two occurrences (so we need to combine them), and rowid 3 is unmatched. (Side note: fuzzyjoin by design duplicates all "by" columns. It's easy enough to reduce them, even if it seems odd.)

out2 <- aggregate(a.y ~ rowid   a.x   b.x, data = out, na.action = na.pass,
                  FUN = function(z) sum(!is.na(z)))
names(out2)[-1] <- c("a", "b", "flag")
out2
#   rowid a   b flag
# 1     1 a  aa    1
# 2     2 b bbb    2
# 3     3 c ccc    0
# 4     4 d   d    1

By-Row

If this causes memory problems, then the next step would be to do this by-row for one of the frames.

d1$flag <- mapply(function(A1, B1) sum(A1 == d2$a & grepl(B1, d2$b)), d1$a, d1$b)
d1
#   a   b flag
# 1 a  aa    1
# 2 b bbb    2
# 3 c ccc    0
# 4 d   d    1

CodePudding user response：

Wanted to create a solution using str_detect and for loops but truthfully it's a lot less efficient than solutions provided both here and in your original post:

d1$flag <- 0

for (i in 1:nrow(d1)) {
  flag <- list()
  for (j in 1:nrow(d2)) {
    flag <- rbind(flag, stringr::str_detect(paste(d1[i,1], d1[i,2]), paste(d2[j,1], d2[j,2])))
  }
  if (any(flag == TRUE)) {
    d1[i, 3] <- 1
  }
}

If you don't want to work with column indices you can replace them with "column name", just make sure to put quotes around it, i.e. d1[i, "flag"].