How to identify rows present in only one of the two datasets by comparing two of the common columns?-CodePudding

I have two dataframes with same column names, examples below.

>dataframe1
   Company_name   Transaction_Code  Sum
1:    First             2000        234
2:    First             3000        562
3:    First             4000        105
4:   Second             8888        740
5:    Third             9000        325
6:    Third             4000        145
7:     BBB              1000         28
8:     BBB              3535        100

>dataframe2
   Company_name   Transaction_Code  Sum
1:    First             2000        340
2:    First             3000        620
3:    First             4000        050
4:   Second             8888        400
5:    Third             9000        250
6:    Third             4000        450
7:     BBB              1000         27

I am trying to check entries by the values of the first two columns to see what entries from dataframe1 are missing in dataframe2. As shown, dataframe1 has entry #8 which is missing from dataframe2. I have seen the dplyr::anti.join solution for such tasks with one condition/column, but it doesn't seem to work when I need to judge entries by values in two columns.

P.S. I did not include any reproducible example, since I saw no point. I am far from an expert in R or general coding, so the question could be lacking in one way or another, sorry.

CodePudding user response：

setdiff() might be what you're looking for :

df1 <- data.frame(company = c("first","first","first","second","third","third","BBB","BBB"),
                  transac = c(2000,3000,4000,8888,9000,4000,1000,3535),
                  sum=c(234,562,105,740,325,145,28,100))

df2 <- data.frame(company = c("first","first","first","second","third","third","BBB"),
                  transac = c(2000,3000,4000,8888,9000,4000,1000),
                  sum=c(340,620,050,400,250,450,27))


setdiff(df1[,1:2],df2[,1:2])

returns

 company transac
1     BBB    3535

CodePudding user response：

This can be done by anti_join() if you specify the columns you want to use for the join.

library(dplyr)
library(tibble)
dataframe1 = tribble(
  ~Company_name,  ~Transaction_Code,  ~Sum,
     "First",             2000,        234,
     "First",             3000,        562,
     "First",             4000,        105,
     "Second",            8888,        740,
     "Third",             9000,        325,
     "Third",             4000,        145,
     "BBB",               1000,         28,
     "BBB",               3535,        100
)

dataframe2 = tribble(
  ~Company_name,   ~Transaction_Code,  ~Sum,
      "First",             2000,        340,
      "First",             3000,        620,
      "First",             4000,        050,
      "Second",            8888,        400,
      "Third",             9000,        250,
      "Third",             4000,        450,
      "BBB",               1000,        27
)

anti_join(dataframe1, dataframe2, by = c("Company_name", "Transaction_Code"))