Home > Software design >  Is there an efficient way to compare two dataframes of different sizes?
Is there an efficient way to compare two dataframes of different sizes?

Time:11-08

I found this post, but it's not quite my scenario. Is there an efficient way of comparing two data frames

The reason I want to compare two dataframes is that I am looking for changes that may have occured. (Think "audit"). The two frames have exactly the same column layout, just that one may have more or less rows than the other, and values may have changed in two of the columns.

       ID       Period          Price       OtherID
0      10100001 2019-10-01     8995.00         ABBD
1      38730001 2019-11-01    38227.71         EIRU
2      30100453 2019-12-01    22307.00         DDHF
3      92835543 2020-01-01     2310.00         DDGF
4      66453422 2020-02-01    12113.29         DGFH


       ID       Period          Price       OtherID
0      10100001 2019-10-01        5.00         ABBD
1      38730001 2019-11-01    38227.71         XXXX
2      30100453 2019-12-01    22307.00         DDHF
3      92835543 2020-01-01     2310.00         DDGF
4      66453422 2020-02-01    12113.29         DGFH
5      22223422 2020-02-01      123.29         HHYG

The two columns that I am suspicious about are "Price" and "OtherID". I want to find any changes in the Price or OtherID.

I see three scenarios:

1. row has been added
2. row has been deleted
3. row doesn't match

I can iterate through it all, but I'm wondering if there is some pandas magic out there that will do it in one fell swoop.

The output I seek is something like this:

       ID       Period         Analysis
0      10100001 2019-10-01     Changed
1      38730001 2019-11-01     Changed
3      22223422 2020-02-01     New

And just to be clear, ID by itself is not unique. And Period by itself is also not unique. The two together are.

CodePudding user response:

Use compare after merge your 2 dataframes on ID and Period:

out = pd.merge(df1, df2, on=['ID', 'Period'], how='outer',
               suffixes=('_df1', '_df2')).set_index(['ID', 'Period'])

out.columns = pd.MultiIndex.from_tuples(out.columns.str.split('_').map(tuple)) \
                           .swaplevel()

out = out['df1'].compare(out['df2'])

Output:

>>> out
                      Price         OtherID      
                       self   other    self other
ID       Period                                  
10100001 2019-10-01  8995.0    5.00     NaN   NaN
38730001 2019-11-01     NaN     NaN    EIRU  XXXX
22223422 2020-02-01     NaN  123.29     NaN  HHYG

# Summarize
>>> out.swaplevel(axis=1)['self'].isna().all(axis=1) \
       .replace({True: 'New', False: 'Changed'})
ID        Period    
10100001  2019-10-01    Changed
38730001  2019-11-01    Changed
22223422  2020-02-01        New
dtype: object

If you append keep_shape=True parameter:

>>> out['df1'].compare(out['df2'], keep_shape=True)
                      Price         OtherID      
                       self   other    self other
ID       Period                                  
10100001 2019-10-01  8995.0    5.00     NaN   NaN
38730001 2019-11-01     NaN     NaN    EIRU  XXXX
30100453 2019-12-01     NaN     NaN     NaN   NaN
92835543 2020-01-01     NaN     NaN     NaN   NaN
66453422 2020-02-01     NaN     NaN     NaN   NaN
22223422 2020-02-01     NaN  123.29     NaN  HHYG

CodePudding user response:

I think you should try apply

Make a separate dataframe with desired headers

df_update = pd.DataFrame(updated_data,columns =['ID','PERIOD'])

And then apply this function

result = df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]

It worked for me and I generate the updated entries dataframe through this. Let me know if its your desired solution

  • Related