Home > Back-end >  How to identify rows present in only one of the two datasets by comparing two of the common columns?
How to identify rows present in only one of the two datasets by comparing two of the common columns?

Time:10-29

I have two dataframes with same column names, examples below.

>dataframe1
   Company_name   Transaction_Code  Sum
1:    First             2000        234
2:    First             3000        562
3:    First             4000        105
4:   Second             8888        740
5:    Third             9000        325
6:    Third             4000        145
7:     BBB              1000         28
8:     BBB              3535        100
>dataframe2
   Company_name   Transaction_Code  Sum
1:    First             2000        340
2:    First             3000        620
3:    First             4000        050
4:   Second             8888        400
5:    Third             9000        250
6:    Third             4000        450
7:     BBB              1000         27

I am trying to check entries by the values of the first two columns to see what entries from dataframe1 are missing in dataframe2. As shown, dataframe1 has entry #8 which is missing from dataframe2. I have seen the dplyr::anti.join solution for such tasks with one condition/column, but it doesn't seem to work when I need to judge entries by values in two columns.

P.S. I did not include any reproducible example, since I saw no point. I am far from an expert in R or general coding, so the question could be lacking in one way or another, sorry.

CodePudding user response:

setdiff() might be what you're looking for :

df1 <- data.frame(company = c("first","first","first","second","third","third","BBB","BBB"),
                  transac = c(2000,3000,4000,8888,9000,4000,1000,3535),
                  sum=c(234,562,105,740,325,145,28,100))

df2 <- data.frame(company = c("first","first","first","second","third","third","BBB"),
                  transac = c(2000,3000,4000,8888,9000,4000,1000),
                  sum=c(340,620,050,400,250,450,27))


setdiff(df1[,1:2],df2[,1:2])

returns

 company transac
1     BBB    3535

CodePudding user response:

This can be done by anti_join() if you specify the columns you want to use for the join.

library(dplyr)
library(tibble)
dataframe1 = tribble(
  ~Company_name,  ~Transaction_Code,  ~Sum,
     "First",             2000,        234,
     "First",             3000,        562,
     "First",             4000,        105,
     "Second",            8888,        740,
     "Third",             9000,        325,
     "Third",             4000,        145,
     "BBB",               1000,         28,
     "BBB",               3535,        100
)

dataframe2 = tribble(
  ~Company_name,   ~Transaction_Code,  ~Sum,
      "First",             2000,        340,
      "First",             3000,        620,
      "First",             4000,        050,
      "Second",            8888,        400,
      "Third",             9000,        250,
      "Third",             4000,        450,
      "BBB",               1000,        27
)

anti_join(dataframe1, dataframe2, by = c("Company_name", "Transaction_Code"))
  • Related