I have a large dataset where I want to filter out rows with matching column values in another dataframe that is of a different length.
As a simple example:
df1
A B date
1 3 3-10-2022
1 2 3-10-2022
2 2 3-5-2022
3 NA 4-5-2022
3 2 4-5-2022
4 NA 4-5-2022
df2
A date2
1 3-10-2022
2 4-10-2022
3 4-5-2022
The goal is to exclude rows in df1 where column values both A and date match e.g. df1$A = df2$A AND df1$date = df2$date2 such that my new data frame is:
Desired results
df3
A B date
2 2 3-5-2022
4 NA 4-5-2022
I have tried the following but have found that my results do not appropriately exclude rows. I also get the error message of "longer object length is not a multiple of shorter object length" and am wondering if this is the issue.
df3 <- df1[!(df1$A == df2$A & df1$date == df2$date),]
Incorrect results:
df3
A B date
1 2 3-10-2022
2 2 3-5-2022
3 NA 4-5-2022
3 2 4-5-2022
4 NA 4-5-2022
The issue appears to be with rows where A includes duplicated value and/or the row contains an NA value, it is incorrectly retained. Can you please advise?
CodePudding user response:
How about this? I just removed "!".
A<-c(1,2,3)
B<-c(3,NA,4)
date<-c("3-10-2022","3-5-2022","4-5-2022")
(df1<-data.frame(A,B,date))
date2<-c("3-10-2022","4-10-2022","4-5-2022")
(df2<-data.frame(A,date2))
(df3 <- df1[(df1$A == df2$A & df1$date == df2$date),])
# A B date
# 1 1 3 3-10-2022
# 3 3 4 4-5-2022
CodePudding user response:
This is a classical case for an anti join:
df1<- data.frame(A= c(1,1,2,3,3,4),
B= c(3,2,2,NA, 2, NA),
date= c("2022-10-03",
"2022-10-03",
"2022-05-03",
rep("2022-05-04",3))
)
df2 <- data.frame(A= 1:3,
date= c("2022-10-03",
"2022-10-04",
"2022-05-04"))
require(dplyr)
df1 %>% anti_join(df2, by= c("A", "date"))
CodePudding user response:
OK. Try "%in%" instead of "==".
(df3 <- df1[!(df1$A %in% df2$A & df1$date %in% df2$date),])
A B date
3 2 2 3-5-2022
6 4 NA 4-5-2022
Because df2$A is a vector.