Home > Software engineering >  R filter or subset based on list of values in two rows
R filter or subset based on list of values in two rows

Time:10-07

I have a large dataset where I want to filter out rows with matching column values in another dataframe that is of a different length.

As a simple example:

df1
 A  B      date
 1  3 3-10-2022
 1  2 3-10-2022
 2  2  3-5-2022
 3 NA  4-5-2022
 3  2  4-5-2022
 4 NA  4-5-2022

df2
  A     date2
 1 3-10-2022
 2 4-10-2022
 3  4-5-2022

The goal is to exclude rows in df1 where column values both A and date match e.g. df1$A = df2$A AND df1$date = df2$date2 such that my new data frame is:

 Desired results
df3
 A  B      date
 2  2  3-5-2022
 4 NA  4-5-2022

I have tried the following but have found that my results do not appropriately exclude rows. I also get the error message of "longer object length is not a multiple of shorter object length" and am wondering if this is the issue.

df3 <- df1[!(df1$A == df2$A & df1$date == df2$date),]

Incorrect results:
df3
   A  B      date
  1  2 3-10-2022
  2  2  3-5-2022
  3 NA  4-5-2022
  3 2  4-5-2022
  4 NA  4-5-2022

The issue appears to be with rows where A includes duplicated value and/or the row contains an NA value, it is incorrectly retained. Can you please advise?

CodePudding user response:

How about this? I just removed "!".

A<-c(1,2,3)
B<-c(3,NA,4)
date<-c("3-10-2022","3-5-2022","4-5-2022")
(df1<-data.frame(A,B,date))
date2<-c("3-10-2022","4-10-2022","4-5-2022")
(df2<-data.frame(A,date2))

(df3 <- df1[(df1$A == df2$A & df1$date == df2$date),])

#   A B    date
# 1 1 3 3-10-2022
# 3 3 4  4-5-2022

CodePudding user response:

This is a classical case for an anti join:

df1<- data.frame(A= c(1,1,2,3,3,4),
           B= c(3,2,2,NA, 2, NA),
           date= c("2022-10-03",
                   "2022-10-03",
                   "2022-05-03",
                   rep("2022-05-04",3))
           )

df2 <- data.frame(A= 1:3,
                  date= c("2022-10-03",
                          "2022-10-04",
                          "2022-05-04"))

require(dplyr)

df1 %>% anti_join(df2, by= c("A", "date"))

CodePudding user response:

OK. Try "%in%" instead of "==".

(df3 <- df1[!(df1$A %in% df2$A & df1$date %in% df2$date),])

  A  B     date
3 2  2 3-5-2022
6 4 NA 4-5-2022

Because df2$A is a vector.

  • Related