Home > Software design >  Compare different df's row by row and return changes
Compare different df's row by row and return changes

Time:11-03

Every month I collect data that contains details of employees to be stored in our database. I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.

I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.

There are also some important details to mention:

  • Each column can also contain blank values in any of the dataframes;
  • The dataframes have the same column names but not necessarily the same data type;
  • The dataframes do not have the same number of rows necessarily;
  • If a row do not find its Index match, do not return to the new dataframe;
  • The rows of the dataframes can be matched by a column named "Index"

So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):

df1:
Index      Department      Salary      Manager      Email      Start_Date
1          IT              6000.00     Jack         [email protected]   01-01-2021
2          HR              7000        O'Donnel     [email protected]   
3          MKT             $7600       Maria        d          30-06-2021
4          I'T             8000        Peter        [email protected]   14-07-2021


df2:
Index      Department      Salary      Manager      Email      Start_Date
1          IT              6000.00     Jack         [email protected]   01-01-2021
2          HR              7000        O'Donnel     [email protected]   01-01-2021
3          MKT             7600        Maria        [email protected]   30-06-2021
4          IT              8000        Peter        [email protected]   14-07-2021
5          IT              9000        John                    NOT PROVIDED
6          IT              9900        John                    NOT PROVIDED

df3:
Index      Department      Salary      Manager      Email      Start_Date
2          HR              7000        O'Donnel     [email protected]   01-01-2021
3          MKT             7600        Maria        [email protected]   30-06-2021
4          IT              8000        Peter        [email protected]   14-07-2021

**The differences in this example are:

  • Start date added in row of Index 2
  • Salary format corrected and email corrected for row Index 3
  • Department format corrected for row Index 4

What would be the best way to to this comparison? I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.

Thank you for the support!

CodePudding user response:

I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html

But first you would need to align the rows between old and new dataframe via the index:

new_df_to_compare=new_df.loc[old_df.index]

When datatypes don't match. You would also need to align them:

new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())

Then compare should work just like this:

difference_df = old_df.compare(new_df_to_compare)

CodePudding user response:

There is a rather compact solution:

df1.iloc[df2.index[df2.index.isin(df1.index)]].compare(df2)

However, maybe you wish a different output.

df1.iloc[df1.iloc[df2.index[df2.index.isin(df1.index)]].compare(df2).index]

This gives exactly what you want.

  • Related