I am using R
to work with some dataframes. My issue is related on how to check if values in a variable in a first dataframe match with values in another dataframe. The match is very different to those like merge
or join
. I will introduce my dataframes (dput()
at end):
My first dataframe is df1
. It contains the variable name
which I want to contrast with other variable in a second dataframe. It looks like this:
df1
name
1 JUAN GIRON
2 GINA OLEAS
3 JUAN FERNANDO ELIZAGA
4 MARCO TORRES
5 JUAN PABLO GONZALEZ
6 IRMA GOMEZ
The second dataframe is df2
. It also contains a variable name
which will be used to the contrast with name
from df1
. It looks like this (In a real situation df2
can be very large with more than 1000 rows):
df2
name val
1 JUANA MARQUEZ 1
2 FERNANDO ELIZAGA 2
3 IRMA GOMEZ 3
4 PABLO GONZALEZ 4
5 GINA LUCIO 5
6 MARK TORRES 6
7 LETICIA BLACIO 7
8 JUAN PABLO GIRON BELTRAN 8
I am looking for a way to check if every row of df1
for name
variable is contained or match with any value for name
in df2
. For example, the value JUAN GIRON
after checking with name
from df2
should return, give a value of yes
because it is contained in the string JUAN PABLO GIRON BELTRAN
from df2
. The same case would apply for the other values. In the end I would like to have something like this:
df3
name val
1 JUAN GIRON yes
2 GINA OLEAS no
3 JUAN FERNANDO ELIZAGA yes
4 MARCO TORRES no
5 JUAN PABLO GONZALEZ yes
6 IRMA GOMEZ yes
How can I reach that result? I have tried with grepl()
concatenating the strings using |
but it is not working because some values are returning a yes
match when there is not match.
Also, as data can be large, I would like to have a solution with dplyr
because the comparison is by row so it can be slow. Or any fast solution is welcome. Many thanks!
Data is next:
#df1
df1 <- structure(list(name = c("JUAN GIRON", "GINA OLEAS", "JUAN FERNANDO ELIZAGA",
"MARCO TORRES", "JUAN PABLO GONZALEZ", "IRMA GOMEZ")), row.names = c(NA,
-6L), class = "data.frame")
#df2
df2 <- structure(list(name = c("JUANA MARQUEZ", "FERNANDO ELIZAGA",
"IRMA GOMEZ", "PABLO GONZALEZ", "GINA LUCIO", "MARK TORRES",
"LETICIA BLACIO", "JUAN PABLO GIRON BELTRAN"), val = 1:8), row.names = c(NA,
-8L), class = "data.frame")
CodePudding user response:
Perhaps we can do like this
df1 %>%
mutate(val = c("no", "yes")[1 (rowSums(
outer(
strsplit(name, "\\s "),
strsplit(df2$name, "\\s "),
Vectorize(function(x, y) all(x %in% y) | all(y %in% x))
)
) > 0)])
which gives
name val
1 JUAN GIRON yes
2 GINA OLEAS no
3 JUAN FERNANDO ELIZAGA yes
4 MARCO TORRES no
5 JUAN PABLO GONZALEZ yes
6 IRMA GOMEZ yes
CodePudding user response:
Here's an approach that uses a regex pattern and handles names that are either length 2 or 3. There's room for improvement, and I'd love to read other answers to this question.
# Input
a <- strsplit(df2$name, " ")
# Output
b <- c()
# Define regex pattern
for(i in 1:length(a)){
if(length(a[[i]]) == 3){
temp <- paste0(
a[[i]][1], " ", a[[i]][2], "|",
a[[i]][1], " ", a[[i]][3], "|",
a[[i]][2], " ", a[[i]][3])
} else if(length(a[[i]] == 2)){
temp <- paste(a[[i]], collapse = " ")
} else {
stop("Length of split name was not 2 or 3")
}
b <- c(b, temp)
}
df1$val <- grepl(paste(b, collapse = "|"), df1$name)
Alternatively, after defining b
using the loop above:
library(dplyr)
patt <- paste(b, collapse = "|")
df1 %>%
mutate(val = grepl(patt, name))
Result:
> df1
name val
1 JUAN GIRON TRUE
2 GINA OLEAS FALSE
3 JUAN FERNANDO ELIZAGA TRUE
4 MARCO TORRES FALSE
5 JUAN PABLO GONZALEZ TRUE
6 IRMA GOMEZ TRUE
CodePudding user response:
df2_flat <- df2$name |> stringr::str_split(" ") |> purrr::flatten()
df1 |>
mutate(splitnames = stringr::str_split(name, " ")) |>
rowwise() |>
mutate(val = all(splitnames %in% df2_flat)) |>
select(-splitnames)
Produces the list you specified