I am very new to Python so would appreciate any help that I can get.
Problem Statement: Given two similar csv files, assert whether they are the same or not. If not, return the differences in each column. The assert condition should be checked for every column separately. The rows have a unique id. This id may/can not be sorted, Eg- the similar row with id=1 in csv1 might be the last row in csv2 with id=1.
To Do:
- Iterate over rows in csv1
- Find out that row's id and then find that id in csv2.
- Now, compare the data of these two rows and return the differences for each column separately.
Please help me with a solution to this problem or how i should go about it.
CodePudding user response:
You can compare 2 datasets via datacompy: https://capitalone.github.io/datacompy/
from io import StringIO
import pandas as pd
import datacompy
data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001239,1.05,Lucille Bluth,,2017-01-01
"""
data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""
df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))
compare = datacompy.Compare(
df1,
df2,
join_columns='acct_id', #You can also specify a list of columns
)
compare.matches(ignore_extra_columns=False)
# False
# This method prints out a human-readable report summarizing and sampling differences
print(compare.report())
CodePudding user response:
It would have been better if you provided the dataframe.
Let's assume your dfs are df
and df_2
, and your row id is in the id
column. You can compare the columns of the corresponding rows based on the id
as:
df.apply(lambda x: df_2[df_2['id'] == x['id']].squeeze() == x, axis=1)