I have a benchmark dataframe:
my_id parent_id attribute_1 attribute_2 attribute_3 attribute_4
ABC DEF A- 378.8 Accept False
ABS DES A- 388.8 Accept False
ABB DEG A 908.8 Decline True
ABB DEG B- 378.8 Accept False
APP DRE C- 370.8 Accept True
And a dataframe:
my_id parent_id Attribute_1 attribute2 attr_3 attribute_5
ABC DEF A- 478.8 Decline StRing
ABS DES A- 388.8 Accept String
ABB DEG A 908.8 Accept StrIng
ABB DEG C- 378.8 Accept String
APP DRE C- 370.8 Accept STring
As you can see, some mistakes appear every now and then in attribute_1, attribute_2 or attribute_3 (the columns are named differently but they are supposed to contain the same thing).
How do I mark the faulty records when I check if, for each row, these three attributes are exactly the same as in the benchmark? I expect an output similar to this:
faulty_rows =
my_id parent_id Attribute_1 attribute2 attr_3 faulty_attr
ABC DEF A- 478.8 Decline [attribute2, attr_3]
ABB DEG A 908.8 Accept [attr_3]
ABB DEG C- 378.8 Accept [Attribute_1]
What I did was to rename the columns and always join column by column, separately, it gives me an idea about what is wrong, but I want to check the entire row at the same time and also mark where the mistake is. Is that possible? A PySpark or Pandas solution is goo either way, I am interested in the logic.
CodePudding user response:
Might DeepDiff
perhaps be a solution (assuming A
refers to your first dict, and B
to your second one)?
from deepdiff import DeepDiff
print(DeepDiff(A, B, ignore_order=False).pretty())
###resulting output:
###Value of root['attribute_1'][3] changed from "B-" to "C-".
###Value of root['attribute_2'][0] changed from 378.8 to 478.8.
###Value of root['attribute_3'][0] changed from "Accept" to "Decline".
###Value of root['attribute_3'][2] changed from "Decline" to "Accept"
CodePudding user response:
An approach may be using pandas.DataFrame.compare
, which demands:
The same column names and the same shape on both dataframes.
The orders of the rows being consistent -- this can be achieved by preprocessing of the dataframes.
The code:
df1.iloc[:,0:5].compare(df2.iloc[:,0:5])
Output:
attribute_1 attribute_2 attribute_3
self other self other self other
0 NaN NaN 378.8 478.8 Accept Decline
2 NaN NaN NaN NaN Decline Accept
3 B- C- NaN NaN NaN NaN