Home > Blockchain >  Marking records that don't match between two dataframes
Marking records that don't match between two dataframes

Time:10-07

I have a benchmark dataframe:

my_id    parent_id    attribute_1    attribute_2     attribute_3       attribute_4
  ABC          DEF             A-          378.8          Accept             False
  ABS          DES             A-          388.8          Accept             False
  ABB          DEG             A           908.8          Decline             True
  ABB          DEG             B-          378.8          Accept             False
  APP          DRE             C-          370.8          Accept              True

And a dataframe:

my_id    parent_id    Attribute_1     attribute2           attr_3        attribute_5
  ABC          DEF             A-          478.8          Decline              StRing
  ABS          DES             A-          388.8          Accept               String
  ABB          DEG             A           908.8          Accept               StrIng
  ABB          DEG             C-          378.8          Accept               String
  APP          DRE             C-          370.8          Accept               STring

As you can see, some mistakes appear every now and then in attribute_1, attribute_2 or attribute_3 (the columns are named differently but they are supposed to contain the same thing).

How do I mark the faulty records when I check if, for each row, these three attributes are exactly the same as in the benchmark? I expect an output similar to this:

faulty_rows = 

    my_id    parent_id    Attribute_1     attribute2           attr_3       faulty_attr 
      ABC          DEF             A-          478.8          Decline       [attribute2, attr_3]                  
      ABB          DEG             A           908.8          Accept        [attr_3]      
      ABB          DEG             C-          378.8          Accept        [Attribute_1]

What I did was to rename the columns and always join column by column, separately, it gives me an idea about what is wrong, but I want to check the entire row at the same time and also mark where the mistake is. Is that possible? A PySpark or Pandas solution is goo either way, I am interested in the logic.

CodePudding user response:

Might DeepDiff perhaps be a solution (assuming A refers to your first dict, and B to your second one)?

from deepdiff import DeepDiff

print(DeepDiff(A, B, ignore_order=False).pretty())

###resulting output:
###Value of root['attribute_1'][3] changed from "B-" to "C-".
###Value of root['attribute_2'][0] changed from 378.8 to 478.8.
###Value of root['attribute_3'][0] changed from "Accept" to "Decline".
###Value of root['attribute_3'][2] changed from "Decline" to "Accept"

CodePudding user response:

An approach may be using pandas.DataFrame.compare, which demands:

  1. The same column names and the same shape on both dataframes.

  2. The orders of the rows being consistent -- this can be achieved by preprocessing of the dataframes.

The code:

df1.iloc[:,0:5].compare(df2.iloc[:,0:5])

Output:

  attribute_1       attribute_2        attribute_3         
         self other        self  other        self    other
0         NaN   NaN       378.8  478.8      Accept  Decline
2         NaN   NaN         NaN    NaN     Decline   Accept
3          B-    C-         NaN    NaN         NaN      NaN
  • Related