I'm trying to compare two dataframes in order to check what have changed between both of them. This is part of a version control script so I've made a simplified version trying to find a solution:
data = {'ID': ['1', '2', '3', '4'],
'Date': ['23-01-2023', '01-12-1995', '03-07-2013', '05-09-2013'],
'Time': ['01:45:08', '02:15:21', '23:57:14', '03:57:15'],
'Path': ['//server/test/File1.txt', '//server/test/File2.txt', '//server/test/File3.txt', '//server/test/File4.txt'],
}
data2 = {'ID': ['1', '2', '3'],
'Date': ['23-01-2023', '03-07-2013', '01-12-1995', '05-09-2013'],
'Time': ['01:45:08', '23:57:14', '02:17:21', '03:18:31'],
'Path': ['//server/test/File1.txt', '//server/test/File3.txt', '//server/test/File2.txt', '//server/test/File5.txt'],
}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
So I've the 2 dataframes created as follows:
DataFrame 1
| ID | Date | Time | Path |
| 1 | 23-01-2023 | 01:45:08 | //server/test/File1.txt |
| 2 | 01-12-1995 | 02:15:21 | //server/test/File2.txt |
| 3 | 03-07-2013 | 23:57:14 | //server/test/File3.txt |
| 4 | 05-09-2013 | 03:57:15 | //server/test/File4.txt |
DataFrame 2
| ID | Date | Time | Path |
| 1 | 23-01-2023 | 01:45:08 | //server/test/File1.txt |
| 2 | 03-07-2013 | 23:57:14 | //server/test/File3.txt |
| 3 | 01-12-1995 | 02:17:21 | //server/test/File2.txt |
| 4 | 21-11-1991 | 03:18:31 | //server/test/File5.txt |
Taking as reference the first one I know:
- File with ID 4 has been removed
- File 2 have been modified
- New file has been added (ID 4 in table dataframe 2)
At the end I would like to have the following output :
| ID | Date | Time | Path | Status |
| 1 | 23-01-2023 | 01:45:08 | //server/test/File1.txt | - |
| 2 | 01-12-1995 | 02:15:21 | //server/test/File2.txt | UPDATED |
| 3 | 03-07-2013 | 23:57:14 | //server/test/File3.txt | - |
| 4 | 05-09-2013 | 03:57:15 | //server/test/File4.txt | DELETED |
| 5 | 21-11-1991 | 03:18:31 | //server/test/File5.txt | ADDED |
Can that be done using just JOINs of Pandas ?
CodePudding user response:
If your intention is to compare the path, move it to the index and use concat
:
compare = pd.concat([
df.drop(columns="ID").set_index("Path"),
df2.drop(columns="ID").set_index("Path")
], keys=["Old", "New"], axis=1)
# I assume `Date` is non-nullable. You can pick any non-nullable
# column to use for the ADDED / DELETED check
compare["Status"] = np.select(
[
(compare["Old"] == compare["New"]).all(axis=1),
compare[("Old", "Date")].isnull(),
compare[("New", "Date")].isnull(),
],
[
"-",
"ADDED",
"DELETED",
],
"UPDATED",
)
CodePudding user response:
Solution which includes:
- merging dataframes on
Path
key with crucialindicator
param to get a marker denoting which side (or both) succeeded in merge - map
Status
field to predefined mapping{'left_only': 'DELETED', 'right_only': 'ADDED'}
- find
UPDATED
records by basic comparingDate/Time
between sides - adjusting column names and empty fields
df_ = df.merge(df2, on='Path', how='outer', indicator=True)
status_map = {'left_only': 'DELETED', 'right_only': 'ADDED'}
df_['Status'] = df_['_merge'].map(status_map)
df_['Status'] = np.where(df_['_merge'] == 'both',
np.where((df_['Date_y'] != df_['Date_x'])
| (df_['Time_y'] != df_['Time_x']),
'UPDATED', '-'), df_['Status'])
df_.loc[df_['Date_x'].isnull(), ['Date_x', 'Time_x']] = df_[df_['Date_x'].isnull()][['Date_y', 'Time_y']].values
df_.drop(df_.filter(regex='.*(_y|_merge)$').columns, axis=1, inplace=True)
df_.columns = df_.columns.str.replace(r'_x$', '', regex=True)
df_.loc[df_['ID'].isnull(), 'ID'] = range(int(df_['ID'].max()) 1, df_.index.size 1)
print(df_)
ID Date Time Path Status
0 1.0 23-01-2023 01:45:08 //server/test/File1.txt -
1 2.0 01-12-1995 02:15:21 //server/test/File2.txt UPDATED
2 3.0 03-07-2013 23:57:14 //server/test/File3.txt -
3 4.0 05-09-2013 03:57:15 //server/test/File4.txt DELETED
4 5.0 21-11-1991 03:18:31 //server/test/File5.txt ADDED