Every month I collect data that contains details of employees to be stored in our database. I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
- Each column can also contain blank values in any of the dataframes;
- The dataframes have the same column names but not necessarily the same data type;
- The dataframes do not have the same number of rows necessarily;
- If a row do not find its Index match, do not return to the new dataframe;
- The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack [email protected] 01-01-2021
2 HR 7000 O'Donnel [email protected]
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter [email protected] 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack [email protected] 01-01-2021
2 HR 7000 O'Donnel [email protected] 01-01-2021
3 MKT 7600 Maria [email protected] 30-06-2021
4 IT 8000 Peter [email protected] 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel [email protected] 01-01-2021
3 MKT 7600 Maria [email protected] 30-06-2021
4 IT 8000 Peter [email protected] 14-07-2021
**The differences in this example are:
- Start date added in row of Index 2
- Salary format corrected and email corrected for row Index 3
- Department format corrected for row Index 4
What would be the best way to to this comparison? I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!
CodePudding user response:
I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)
CodePudding user response:
There is a rather compact solution:
df1.iloc[df2.index[df2.index.isin(df1.index)]].compare(df2)
However, maybe you wish a different output.
df1.iloc[df1.iloc[df2.index[df2.index.isin(df1.index)]].compare(df2).index]
This gives exactly what you want.