I found this post, but it's not quite my scenario. Is there an efficient way of comparing two data frames
The reason I want to compare two dataframes is that I am looking for changes that may have occured. (Think "audit"). The two frames have exactly the same column layout, just that one may have more or less rows than the other, and values may have changed in two of the columns.
ID Period Price OtherID
0 10100001 2019-10-01 8995.00 ABBD
1 38730001 2019-11-01 38227.71 EIRU
2 30100453 2019-12-01 22307.00 DDHF
3 92835543 2020-01-01 2310.00 DDGF
4 66453422 2020-02-01 12113.29 DGFH
ID Period Price OtherID
0 10100001 2019-10-01 5.00 ABBD
1 38730001 2019-11-01 38227.71 XXXX
2 30100453 2019-12-01 22307.00 DDHF
3 92835543 2020-01-01 2310.00 DDGF
4 66453422 2020-02-01 12113.29 DGFH
5 22223422 2020-02-01 123.29 HHYG
The two columns that I am suspicious about are "Price" and "OtherID". I want to find any changes in the Price or OtherID.
I see three scenarios:
1. row has been added
2. row has been deleted
3. row doesn't match
I can iterate through it all, but I'm wondering if there is some pandas magic out there that will do it in one fell swoop.
The output I seek is something like this:
ID Period Analysis
0 10100001 2019-10-01 Changed
1 38730001 2019-11-01 Changed
3 22223422 2020-02-01 New
And just to be clear, ID by itself is not unique. And Period by itself is also not unique. The two together are.
CodePudding user response:
Use compare
after merge
your 2 dataframes on ID
and Period
:
out = pd.merge(df1, df2, on=['ID', 'Period'], how='outer',
suffixes=('_df1', '_df2')).set_index(['ID', 'Period'])
out.columns = pd.MultiIndex.from_tuples(out.columns.str.split('_').map(tuple)) \
.swaplevel()
out = out['df1'].compare(out['df2'])
Output:
>>> out
Price OtherID
self other self other
ID Period
10100001 2019-10-01 8995.0 5.00 NaN NaN
38730001 2019-11-01 NaN NaN EIRU XXXX
22223422 2020-02-01 NaN 123.29 NaN HHYG
# Summarize
>>> out.swaplevel(axis=1)['self'].isna().all(axis=1) \
.replace({True: 'New', False: 'Changed'})
ID Period
10100001 2019-10-01 Changed
38730001 2019-11-01 Changed
22223422 2020-02-01 New
dtype: object
If you append keep_shape=True
parameter:
>>> out['df1'].compare(out['df2'], keep_shape=True)
Price OtherID
self other self other
ID Period
10100001 2019-10-01 8995.0 5.00 NaN NaN
38730001 2019-11-01 NaN NaN EIRU XXXX
30100453 2019-12-01 NaN NaN NaN NaN
92835543 2020-01-01 NaN NaN NaN NaN
66453422 2020-02-01 NaN NaN NaN NaN
22223422 2020-02-01 NaN 123.29 NaN HHYG
CodePudding user response:
I think you should try apply
Make a separate dataframe with desired headers
df_update = pd.DataFrame(updated_data,columns =['ID','PERIOD'])
And then apply this function
result = df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
It worked for me and I generate the updated entries dataframe through this. Let me know if its your desired solution