Home > Blockchain >  How to compare 2 non-identical dataframes in python
How to compare 2 non-identical dataframes in python

Time:12-10

I have two dataframes with the same column order but different column names and different rows. df2 rows vary from df1 rows.

df1=     col_id  num  name
      0   1     3     linda
      1   2     4     James

df2=     id     no   name
      0   1     2    granpa
      1   2     6    linda
      2   3     7    sam

This is the output I need. Outputs rows with same, OLD and NEW values so the user can clearly see what changed between two dataframes:

 result   col_id        num              name   
      0   1             was 3| now 2    was linda| now granpa  
      1   2             was 4| now 6    was James| now linda
      2   was  | now 3  was  | now 7    was      | now sam

CodePudding user response:

If I understand correctly, you want something like this:

new_df = df1.drop(['name', 'num'], axis=1).merge(df2.rename({'id': 'col_id'}, axis=1), how='outer')

Output:

>>> new_df
   col_id  no    name
0       1   2  granpa
1       2   6   linda
2       3   7     sam

CodePudding user response:

Since your goal is just to compare differences, use DataFrame.compare instead of aggregating into strings.

However, note that DataFrame.compare

can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames

So we just need to align the row/column indexes, either via merge or reindex.


Align via merge

  1. Outer-merge the two dfs:

    merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
    #    col_id   num  name_x  id  no  name_y
    # 0       1     3   linda   1   2  granpa
    # 1       2     4   james   2   6   linda
    # 2    <NA>  <NA>     NaN   3   7     sam
    
  2. Divide the merged frame into left/right frames and align their columns with set_axis:

    cols = df1.columns
    left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
    #    col_id   num   name
    # 0       1     3  linda
    # 1       2     4  james
    # 2    <NA>  <NA>    NaN
    
    right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
    #    col_id  num    name
    # 0       1    2  granpa
    # 1       2    6   linda
    # 2       3    7     sam
    
  3. compare the aligned left/right frames (use keep_equal=True to show equal cells):

    left.compare(right, keep_shape=True, keep_equal=True)
    #        col_id         num          name
    #    self other  self other   self  other
    # 0     1     1     3     2  linda granpa
    # 1     2     2     4     6  james  linda
    # 2  <NA>     3  <NA>     7    NaN    sam
    
    left.compare(right, keep_shape=True)
    #        col_id         num          name
    #    self other  self other   self  other
    # 0  <NA>  <NA>     3     2  linda granpa
    # 1  <NA>  <NA>     4     6  james  linda
    # 2  <NA>  <NA>  <NA>  <NA>    NaN    sam
    

Align via reindex

If you are 100% sure that one df is a subset of the other, then reindex the subsetted rows.

In your example, df1 is a subset of df2, so reindex df1:

df1.assign(id=df1.col_id)          # copy col_id (we need original col_id after reindexing)
   .set_index('id')                # set index to copied id
   .reindex(df2.id)                # reindex on df2's id
   .reset_index(drop=True)         # remove copied id
   .set_axis(df2.columns, axis=1)  # align column names
   .compare(df2, keep_equal=True, keep_shape=True)

#        col_id         num          name
#    self other  self other   self  other
# 0     1     1     3     2  linda granpa
# 1     2     2     4     6  james  linda
# 2  <NA>     3  <NA>     7    NaN    sam

Nullable integers

Normally int cannot mix with nan, so pandas converts to float. If you want the int values to stay int (like the examples above):

  • either convert the int columns to nullable integers with astype('Int64') (capital I)
  • or convert them to astype(object) (not recommended)
  • Related