How to check if values in first dataframe are contained or match values in another dataframe-CodePudding

The bounty expires in 6 days. Answers to this question are eligible for a 300 reputation bounty. Duck is looking for an answer from a reputable source:

Hi. This is one of my students needing help. Despite having a solution, we believe it is not enough to solve the issue. Student has added new data as names can have more than three words. The main issue is that second dataframe can be larger so a fast solution is required. Preferred dplyr solution or any quick approach with no loops.

I am using R to work with some dataframes. My issue is related on how to check if values in a variable in a first dataframe match with values in another dataframe. The match is very different to those like merge or join. I will introduce my dataframes (dput() at end):

My first dataframe is df1. It contains the variable name which I want to contrast with other variable in a second dataframe. It looks like this:

df1
                   name
1            JUAN GIRON
2            GINA OLEAS
3 JUAN FERNANDO ELIZAGA
4          MARCO TORRES
5   JUAN PABLO GONZALEZ
6            IRMA GOMEZ

The second dataframe is df2. It also contains a variable name which will be used to the contrast with name from df1. It looks like this (In a real situation df2 can be very large with more than 1000 rows):

df2
                      name val
1            JUANA MARQUEZ   1
2         FERNANDO ELIZAGA   2
3               IRMA GOMEZ   3
4           PABLO GONZALEZ   4
5               GINA LUCIO   5
6              MARK TORRES   6
7           LETICIA BLACIO   7
8 JUAN PABLO GIRON BELTRAN   8

I am looking for a way to check if every row of df1 for name variable is contained or match with any value for name in df2. For example, the value JUAN GIRON after checking with name from df2 should return, give a value of yes because it is contained in the string JUAN PABLO GIRON BELTRAN from df2. The same case would apply for the other values. In the end I would like to have something like this:

df3
                   name val
1            JUAN GIRON yes
2            GINA OLEAS  no
3 JUAN FERNANDO ELIZAGA yes
4          MARCO TORRES  no
5   JUAN PABLO GONZALEZ yes
6            IRMA GOMEZ yes

How can I reach that result? I have tried with grepl() concatenating the strings using | but it is not working because some values are returning a yes match when there is not match.

Also, as data can be large, I would like to have a solution with dplyr because the comparison is by row so it can be slow. Or any fast solution is welcome. Many thanks!

Data is next:

#df1
df1 <- structure(list(name = c("JUAN GIRON", "GINA OLEAS", "JUAN FERNANDO ELIZAGA", 
"MARCO TORRES", "JUAN PABLO GONZALEZ", "IRMA GOMEZ")), row.names = c(NA, 
-6L), class = "data.frame")

#df2
df2 <- structure(list(name = c("JUANA MARQUEZ", "FERNANDO ELIZAGA", 
"IRMA GOMEZ", "PABLO GONZALEZ", "GINA LUCIO", "MARK TORRES", 
"LETICIA BLACIO", "JUAN PABLO GIRON BELTRAN"), val = 1:8), row.names = c(NA, 
-8L), class = "data.frame")

CodePudding user response：

Perhaps we can do like this

df1 %>%
    mutate(val = c("no", "yes")[1   (rowSums(
        outer(
            strsplit(name, "\\s "),
            strsplit(df2$name, "\\s "),
            Vectorize(function(x, y) all(x %in% y) | all(y %in% x))
        )
    ) > 0)])

which gives

                   name val
1            JUAN GIRON yes
2            GINA OLEAS  no
3 JUAN FERNANDO ELIZAGA yes
4          MARCO TORRES  no
5   JUAN PABLO GONZALEZ yes
6            IRMA GOMEZ yes

CodePudding user response：

Here's an approach that uses a regex pattern and handles names that are either length 2 or 3. There's room for improvement, and I'd love to read other answers to this question.

# Input
a <- strsplit(df2$name, " ")
# Output
b <- c()

# Define regex pattern
for(i in 1:length(a)){
  if(length(a[[i]]) == 3){
    temp <- paste0(
        a[[i]][1], " ", a[[i]][2], "|",
        a[[i]][1], " ", a[[i]][3], "|",
        a[[i]][2], " ", a[[i]][3])
  } else if(length(a[[i]] == 2)){
    temp <- paste(a[[i]], collapse = " ")
  } else {
    stop("Length of split name was not 2 or 3")
  }
  b <- c(b, temp)
}

df1$val <- grepl(paste(b, collapse = "|"), df1$name)

Alternatively, after defining b using the loop above:

library(dplyr)
patt <- paste(b, collapse = "|")
df1 %>%
    mutate(val = grepl(patt, name))

Result:

> df1
                   name    val
1            JUAN GIRON    TRUE
2            GINA OLEAS   FALSE
3 JUAN FERNANDO ELIZAGA    TRUE
4          MARCO TORRES   FALSE
5   JUAN PABLO GONZALEZ    TRUE
6            IRMA GOMEZ    TRUE

CodePudding user response：

df2_flat <- df2$name |> stringr::str_split(" ") |> purrr::flatten()                                                                                                                                                                                                 

df1 |> 
    mutate(splitnames = stringr::str_split(name, " ")) |> 
    rowwise() |>
    mutate(val = all(splitnames %in% df2_flat)) |>
    select(-splitnames)

Produces the list you specified