I am trying to compare two CSV files and print the differences in Python as a custom text file.
For example:
CSV 1:
Id, Customer, Status, Date
01, ABC, Good, Mar 2023
02, BAC, Good, Feb 2024
03, CBA, Bad, Apr 2022
CSV 2:
Id, Customer, Status, Date
01, ABC, Bad, Mar 2023
02, BAC, Good, Feb 2024
03, CBA, Good, Apr 2024
Expected Output:
Id 01 Status is changed to Bad
Id 03 Status is changed to Good
Id 03 Date changed is to Apr 2024
Any suggestion/idea to code for the expected output. Thanks
CodePudding user response:
I take it from data frames you are using pandas?
Pandas text export is easy enough as per the link to https://stackoverflow.com/a/31247247/16367225 posted in the comment to your question.
Get whatever result you want in a one-row-per-one-df format (lookup pandas merge if you are unclear on how joins between frames work), and then do something like this: How to iterate over rows in a DataFrame in Pandas which describes how to iterate over rows.
Make a variable for what changed, e.g, StatusChange.
Then iterate over it to make your output
Then that particular row might look something like:
f.writelines(f'Id {row['id']} Status is changed to {row['status']})
If you combine that sort of iterate over one df approach with the text output from the comment you should be right
CodePudding user response:
There is a method in pandas which could help you in doing that:
df.compare
will compare 2 different dataframes and return which values changed in each column for the data records
data1 = [[1, 'ABC', 'Good', 'Mar 2023'], [2, 'BAC', 'Good', 'Feb 2024'], [3, 'CBA', 'Bad', 'Apr 2022']]
df1 = pd.DataFrame(data1, columns=['Id', 'Customer', 'Status', 'Date'])
data2 = [[1, 'ABC', 'Bad', 'Mar 2023'], [2, 'BAC', 'Good', 'Feb 2024'], [3, 'CBA', 'Good', 'Apr 2024']]
df2 = pd.DataFrame(data2, columns=['Id', 'Customer', 'Status', 'Date'])
print("Dataframe difference -- \n")
print(df1.compare(df2))
print("Dataframe difference keeping equal values -- \n")
print(df1.compare(df2, keep_equal=True))
Output:
Dataframe difference --
Status Date
self other self other
0 Good Bad NaN NaN
2 Bad Good Apr 2022 Apr 2024
Dataframe difference keeping equal values --
Status Date
self other self other
0 Good Bad Mar 2023 Mar 2023
2 Bad Good Apr 2022 Apr 2024