I have two dataframes with the same column order but different column names and different rows. df2
rows vary from df1
rows.
df1= col_id num name
0 1 3 linda
1 2 4 James
df2= id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam
This is the output I need. Outputs rows with same, OLD and NEW values so the user can clearly see what changed between two dataframes:
result col_id num name
0 1 was 3| now 2 was linda| now granpa
1 2 was 4| now 6 was James| now linda
2 was | now 3 was | now 7 was | now sam
CodePudding user response:
If I understand correctly, you want something like this:
new_df = df1.drop(['name', 'num'], axis=1).merge(df2.rename({'id': 'col_id'}, axis=1), how='outer')
Output:
>>> new_df
col_id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam
CodePudding user response:
Since your goal is just to compare differences, use DataFrame.compare
instead of aggregating into strings.
However, note that DataFrame.compare
can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
So we just need to align the row/column indexes, either via merge
or reindex
.
Align via merge
Outer-
merge
the two dfs:merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id') # col_id num name_x id no name_y # 0 1 3 linda 1 2 granpa # 1 2 4 james 2 6 linda # 2 <NA> <NA> NaN 3 7 sam
Divide the
merged
frame intoleft
/right
frames and align their columns withset_axis
:cols = df1.columns left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1) # col_id num name # 0 1 3 linda # 1 2 4 james # 2 <NA> <NA> NaN right = merged.iloc[:, len(cols):].set_axis(cols, axis=1) # col_id num name # 0 1 2 granpa # 1 2 6 linda # 2 3 7 sam
compare
the alignedleft
/right
frames (usekeep_equal=True
to show equal cells):left.compare(right, keep_shape=True, keep_equal=True) # col_id num name # self other self other self other # 0 1 1 3 2 linda granpa # 1 2 2 4 6 james linda # 2 <NA> 3 <NA> 7 NaN sam left.compare(right, keep_shape=True) # col_id num name # self other self other self other # 0 <NA> <NA> 3 2 linda granpa # 1 <NA> <NA> 4 6 james linda # 2 <NA> <NA> <NA> <NA> NaN sam
Align via reindex
If you are 100% sure that one df is a subset of the other, then reindex
the subsetted rows.
In your example, df1
is a subset of df2
, so reindex
df1
:
df1.assign(id=df1.col_id) # copy col_id (we need original col_id after reindexing)
.set_index('id') # set index to copied id
.reindex(df2.id) # reindex on df2's id
.reset_index(drop=True) # remove copied id
.set_axis(df2.columns, axis=1) # align column names
.compare(df2, keep_equal=True, keep_shape=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 <NA> 3 <NA> 7 NaN sam
Nullable integers
Normally int
cannot mix with nan
, so pandas converts to float
. If you want the int
values to stay int
(like the examples above):
- either convert the
int
columns to nullable integers withastype('Int64')
(capitalI
) - or convert them to
astype(object)
(not recommended)