Create new column based on matches in another table/merging-CodePudding

I have 2 data frames that look something like this (it is a count table). data1 has a column called "Method_Final" that I would like to add into data2. I want to match it based on ONLY columns Method1, Method2, Method3 (the Col1, Col2, Col3, and Count columns don't need to match, but I want them to be brought into the final dataframe). If there is a match on those 3 columns, take the Method_Final from data1 and put it in data2. If there is no match, then make the value "Not Determined".

Note: data1 has rows that are not in data2. I would only like rows that are in data2 to be in my final table. Any rows that are in data1 that are not in data2 should be removed.

I have an example of what I'm looking for in the data frame below called data_final.

data1 <- data.frame("Col1" = c("ABC", "ABC", "EFG", "XYZ", "ZZZ"), "Col2" = c("AA", 
"AA","AA", "BB", "AA"), "Col3" = c("Al", "B", "Al", "Al", "B"), "Method1" = 
c("Sample", "Dry", "Sample", "Sample", "Dry"), "Method2" = c("Blank", "Not Blank", 
"Blank", "Not Blank", "Not Blank"), "Method3" = c("Yes", "Yes", "No", "No", "No"), 
"Count" = c(1, 4, 6, 2, 4), "Method_Final" = c("AAR", "ARG", "PCO", "YRG", "ZYX"))

data2 <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2" = 
c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"), 
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" = 
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" = 
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1))

This is the data set I would like to end up with:

data_final <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2" 
= c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"), 
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" = 
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" = 
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1), 
"Method_Final" = c("AAR", "ARG", "AAR", "YRG", "Not Determined", "Not Determined"))

I've tried various joins using dplyr (left_join, right_join, etc.) and I can't figure it out.

Thank you so much!

CodePudding user response：

You can do:

library(tidyverse)
data2 %>%
  left_join(data1 %>%
              select(starts_with('Method')),
            by = paste0('Method', 1:3)) %>%
  mutate(Method_Final = if_else(is.na(Method_Final), 'Not determined', Method_Final))

which gives:

  Col1 Col2 Col3 Method1   Method2 Method3 Count   Method_Final
1  ABC   AA   Al  Sample     Blank     Yes     1            AAR
2  ABC   AA    B     Dry Not Blank     Yes     4            ARG
3  ABC   CC    C  Sample     Blank     Yes     5            AAR
4  EFG   AA   Al  Sample Not Blank      No     6            YRG
5  XYZ   BB   Al     Dry Not Blank      No     2            ZYX
6  XYZ   CC    C  Bucket Not Blank     Yes     1 Not determined

Note that this differs from your expected output for the fifth row. Can you please check, what the Method_Fianl value should be here? Since there is a value for it in data1, it shouldn‘t be 'Not determined'.

CodePudding user response：

You can try this. Combine the fields to match into one:

dm1 <- paste(data1$Method1, data1$Method2, data1$Method3, sep="|")
dm2 <- paste(data2$Method1, data2$Method2, data2$Method3, sep="|")

Now match the two:

m <- match(dm2, dm1)  
# will return NA where not matching

Get the Method_Final from data1 where it matches:

data2$Method_Final <- as.character(data1$Method_Final[m])

Where NA, make it "Not Determined":

data2$Method_Final[is.na(data2$Method_Final)] <- "Not Determined"

Result is the same as @deschen's above.