I have 2 data frames that look something like this (it is a count table). data1 has a column called "Method_Final" that I would like to add into data2. I want to match it based on ONLY columns Method1, Method2, Method3 (the Col1, Col2, Col3, and Count columns don't need to match, but I want them to be brought into the final dataframe). If there is a match on those 3 columns, take the Method_Final from data1 and put it in data2. If there is no match, then make the value "Not Determined".
Note: data1 has rows that are not in data2. I would only like rows that are in data2 to be in my final table. Any rows that are in data1 that are not in data2 should be removed.
I have an example of what I'm looking for in the data frame below called data_final.
data1 <- data.frame("Col1" = c("ABC", "ABC", "EFG", "XYZ", "ZZZ"), "Col2" = c("AA",
"AA","AA", "BB", "AA"), "Col3" = c("Al", "B", "Al", "Al", "B"), "Method1" =
c("Sample", "Dry", "Sample", "Sample", "Dry"), "Method2" = c("Blank", "Not Blank",
"Blank", "Not Blank", "Not Blank"), "Method3" = c("Yes", "Yes", "No", "No", "No"),
"Count" = c(1, 4, 6, 2, 4), "Method_Final" = c("AAR", "ARG", "PCO", "YRG", "ZYX"))
data2 <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2" =
c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" =
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" =
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1))
This is the data set I would like to end up with:
data_final <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2"
= c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" =
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" =
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1),
"Method_Final" = c("AAR", "ARG", "AAR", "YRG", "Not Determined", "Not Determined"))
I've tried various joins using dplyr (left_join, right_join, etc.) and I can't figure it out.
Thank you so much!
CodePudding user response:
You can do:
library(tidyverse)
data2 %>%
left_join(data1 %>%
select(starts_with('Method')),
by = paste0('Method', 1:3)) %>%
mutate(Method_Final = if_else(is.na(Method_Final), 'Not determined', Method_Final))
which gives:
Col1 Col2 Col3 Method1 Method2 Method3 Count Method_Final
1 ABC AA Al Sample Blank Yes 1 AAR
2 ABC AA B Dry Not Blank Yes 4 ARG
3 ABC CC C Sample Blank Yes 5 AAR
4 EFG AA Al Sample Not Blank No 6 YRG
5 XYZ BB Al Dry Not Blank No 2 ZYX
6 XYZ CC C Bucket Not Blank Yes 1 Not determined
Note that this differs from your expected output for the fifth row. Can you please check, what the Method_Fianl value should be here? Since there is a value for it in data1, it shouldn‘t be 'Not determined'.
CodePudding user response:
You can try this. Combine the fields to match into one:
dm1 <- paste(data1$Method1, data1$Method2, data1$Method3, sep="|")
dm2 <- paste(data2$Method1, data2$Method2, data2$Method3, sep="|")
Now match the two:
m <- match(dm2, dm1)
# will return NA where not matching
Get the Method_Final
from data1
where it matches:
data2$Method_Final <- as.character(data1$Method_Final[m])
Where NA
, make it "Not Determined":
data2$Method_Final[is.na(data2$Method_Final)] <- "Not Determined"
Result is the same as @deschen's above.