How to compare 2 non-identical dataframes in python-CodePudding

I have two dataframes with the same column order but different column names and different rows. df2 rows vary from df1 rows.

df1=     col_id  num  name
      0   1     3     linda
      1   2     4     James

df2=     id     no   name
      0   1     2    granpa
      1   2     6    linda
      2   3     7    sam

This is the output I need. Outputs rows with same, OLD and NEW values so the user can clearly see what changed between two dataframes:

 result   col_id        num              name   
      0   1             was 3| now 2    was linda| now granpa  
      1   2             was 4| now 6    was James| now linda
      2   was  | now 3  was  | now 7    was      | now sam

CodePudding user response：

If I understand correctly, you want something like this:

new_df = df1.drop(['name', 'num'], axis=1).merge(df2.rename({'id': 'col_id'}, axis=1), how='outer')

Output:

>>> new_df
   col_id  no    name
0       1   2  granpa
1       2   6   linda
2       3   7     sam

CodePudding user response：

Since your goal is just to compare differences, use DataFrame.compare instead of aggregating into strings.

However, note that DataFrame.compare

can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames

So we just need to align the row/column indexes, either via merge or reindex.

Align via `merge`

Outer-merge the two dfs:

merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
#    col_id   num  name_x  id  no  name_y
# 0       1     3   linda   1   2  granpa
# 1       2     4   james   2   6   linda
# 2    <NA>  <NA>     NaN   3   7     sam

Divide the merged frame into left/right frames and align their columns with set_axis:

cols = df1.columns
left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
#    col_id   num   name
# 0       1     3  linda
# 1       2     4  james
# 2    <NA>  <NA>    NaN

right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
#    col_id  num    name
# 0       1    2  granpa
# 1       2    6   linda
# 2       3    7     sam

compare the aligned left/right frames (use keep_equal=True to show equal cells):

left.compare(right, keep_shape=True, keep_equal=True)
#        col_id         num          name
#    self other  self other   self  other
# 0     1     1     3     2  linda granpa
# 1     2     2     4     6  james  linda
# 2  <NA>     3  <NA>     7    NaN    sam

left.compare(right, keep_shape=True)
#        col_id         num          name
#    self other  self other   self  other
# 0  <NA>  <NA>     3     2  linda granpa
# 1  <NA>  <NA>     4     6  james  linda
# 2  <NA>  <NA>  <NA>  <NA>    NaN    sam

Align via `reindex`

If you are 100% sure that one df is a subset of the other, then reindex the subsetted rows.

In your example, df1 is a subset of df2, so reindex df1:

df1.assign(id=df1.col_id)          # copy col_id (we need original col_id after reindexing)
   .set_index('id')                # set index to copied id
   .reindex(df2.id)                # reindex on df2's id
   .reset_index(drop=True)         # remove copied id
   .set_axis(df2.columns, axis=1)  # align column names
   .compare(df2, keep_equal=True, keep_shape=True)

#        col_id         num          name
#    self other  self other   self  other
# 0     1     1     3     2  linda granpa
# 1     2     2     4     6  james  linda
# 2  <NA>     3  <NA>     7    NaN    sam

Nullable integers

Normally int cannot mix with nan, so pandas converts to float. If you want the int values to stay int (like the examples above):

either convert the int columns to nullable integers with astype('Int64') (capital I)
or convert them to astype(object) (not recommended)

Align via merge

Align via reindex

Nullable integers

Align via `merge`

Align via `reindex`