Home > front end >  How to match multiple columns without merge?
How to match multiple columns without merge?

Time:10-26

I have those two df's:

ID1 <- c("TRZ00897", "AAR9832", "NZU44447683209", "sxc89898989M", "RSU765th89", "FFF")
Date1 <- c("2022-08-21","2022-03-22","2022-09-24",  "2022-09-21", "2022-09-22", "2022-09-22")
Data1 <- data.frame(ID1,Date1)

ID <- c("RSU765th89",  "NZU44447683209", "AAR9832", "TRZ00897","ERD895655", "FFF", "IUHG0" )
Date <- c("2022-09-22","2022-09-21", "2022-03-22", "2022-08-21", "2022-09-21", "2022-09-22", "2022-09-22" )
Data2 <- data.frame(ID,Date)

I tried to get exact matches. An exact match is if ID and Date are the same in both df's, for example: "TRZ00897" "2022-08-21" is an exact match, because it is present in both df's

With the following line of code:

match(Data1$ID1, Data2$ID) == match(Data1$Date1, Data2$Date)

the output is:

TRUE  TRUE    NA    NA  TRUE FALSE

Obviously the last one should not be FALSE because "FFF" "2022-09-22" is in both df. The reason why it is FALSE is, that the Date"2022-09-22" occurred already in Data2 at index position 1.

match(Data1$ID1, Data2$ID)
  4  3  2 NA  1  6

match(Data1$Date1, Data2$Date)
  4  3 NA  2  1  1

So at the end, there is index position 6 and 1 which is not equal --> FALSE

How can I change this? Which function should I use to get the correct answer.

Note, I don't need to merge or join etc. I'm really looking for a function that can detect those patterns.

CodePudding user response:

Combine the columns then match:

match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date))
# [1]  4  3 NA NA  1  6

To get logical outut use %in%:

paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date)
# [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE

CodePudding user response:

Try match with asplit (since you have different column names for two dataframes, I have to manually remove the names using unname, which can be avoided if both of them have the same names)

> match(asplit(unname(Data1), 1), asplit(unname(Data2), 1))
[1]  4  3 NA NA  1  6

Another option that is memory-expensive option is using interaction

> match(interaction(Data1), interaction(Data2))
[1]  4  3 NA NA  1  6

CodePudding user response:

With mapply and %in%:

apply(mapply(`%in%`, Data1, Data2), 1, all)
[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE

rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1)

Edit; for a subset of columns:

idx <- c(1, 2)
apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all)
#[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE
  • Related