Home > Net >  Version control with DataFrames
Version control with DataFrames

Time:02-01

I'm trying to compare two dataframes in order to check what have changed between both of them. This is part of a version control script so I've made a simplified version trying to find a solution:

data = {'ID':  ['1', '2', '3', '4'],
        'Date':  ['23-01-2023', '01-12-1995', '03-07-2013', '05-09-2013'],
        'Time':  ['01:45:08', '02:15:21', '23:57:14', '03:57:15'],
        'Path':  ['//server/test/File1.txt', '//server/test/File2.txt', '//server/test/File3.txt', '//server/test/File4.txt'],
        }
data2 = {'ID':  ['1', '2', '3'],

        'Date':  ['23-01-2023', '03-07-2013', '01-12-1995', '05-09-2013'],
        'Time':  ['01:45:08', '23:57:14', '02:17:21', '03:18:31'],
        'Path':  ['//server/test/File1.txt', '//server/test/File3.txt', '//server/test/File2.txt', '//server/test/File5.txt'],
        }

df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)

So I've the 2 dataframes created as follows:

DataFrame 1

 | ID | Date       | Time      | Path                       |
 |  1 | 23-01-2023 | 01:45:08  | //server/test/File1.txt    |
 |  2 | 01-12-1995 | 02:15:21  | //server/test/File2.txt    |
 |  3 | 03-07-2013 | 23:57:14  | //server/test/File3.txt    |
 |  4 | 05-09-2013 | 03:57:15  | //server/test/File4.txt    |

DataFrame 2

 | ID | Date       | Time      | Path                       |
 |  1 | 23-01-2023 | 01:45:08  | //server/test/File1.txt    |
 |  2 | 03-07-2013 | 23:57:14  | //server/test/File3.txt    |
 |  3 | 01-12-1995 | 02:17:21  | //server/test/File2.txt    |
 |  4 | 21-11-1991 | 03:18:31  | //server/test/File5.txt    |

Taking as reference the first one I know:

  1. File with ID 4 has been removed
  2. File 2 have been modified
  3. New file has been added (ID 4 in table dataframe 2)

At the end I would like to have the following output :

 | ID | Date       | Time      | Path                       |  Status  |
 |  1 | 23-01-2023 | 01:45:08  | //server/test/File1.txt    |     -    |
 |  2 | 01-12-1995 | 02:15:21  | //server/test/File2.txt    |  UPDATED |
 |  3 | 03-07-2013 | 23:57:14  | //server/test/File3.txt    |     -    |
 |  4 | 05-09-2013 | 03:57:15  | //server/test/File4.txt    |  DELETED |
 |  5 | 21-11-1991 | 03:18:31  | //server/test/File5.txt    |   ADDED  |

Can that be done using just JOINs of Pandas ?

CodePudding user response:

If your intention is to compare the path, move it to the index and use concat:

compare = pd.concat([
    df.drop(columns="ID").set_index("Path"),
    df2.drop(columns="ID").set_index("Path")
], keys=["Old", "New"], axis=1)

# I assume `Date` is non-nullable. You can pick any non-nullable
# column to use for the ADDED / DELETED check
compare["Status"] = np.select(
    [
        (compare["Old"] == compare["New"]).all(axis=1),
        compare[("Old", "Date")].isnull(),
        compare[("New", "Date")].isnull(),
    ],
    [
        "-",
        "ADDED",
        "DELETED",
    ],
    "UPDATED",
)

CodePudding user response:

Solution which includes:

  • merging dataframes on Path key with crucial indicator param to get a marker denoting which side (or both) succeeded in merge
  • map Status field to predefined mapping {'left_only': 'DELETED', 'right_only': 'ADDED'}
  • find UPDATED records by basic comparing Date/Time between sides
  • adjusting column names and empty fields

df_ = df.merge(df2, on='Path', how='outer', indicator=True)
status_map = {'left_only': 'DELETED', 'right_only': 'ADDED'}
df_['Status'] = df_['_merge'].map(status_map)
df_['Status'] = np.where(df_['_merge'] == 'both',
                     np.where((df_['Date_y'] != df_['Date_x'])
                              | (df_['Time_y'] != df_['Time_x']),
                              'UPDATED', '-'), df_['Status'])
df_.loc[df_['Date_x'].isnull(), ['Date_x', 'Time_x']] = df_[df_['Date_x'].isnull()][['Date_y', 'Time_y']].values
df_.drop(df_.filter(regex='.*(_y|_merge)$').columns, axis=1, inplace=True)
df_.columns = df_.columns.str.replace(r'_x$', '', regex=True)
df_.loc[df_['ID'].isnull(), 'ID'] = range(int(df_['ID'].max())   1, df_.index.size   1)
print(df_)

    ID        Date      Time                     Path   Status
0  1.0  23-01-2023  01:45:08  //server/test/File1.txt        -
1  2.0  01-12-1995  02:15:21  //server/test/File2.txt  UPDATED
2  3.0  03-07-2013  23:57:14  //server/test/File3.txt        -
3  4.0  05-09-2013  03:57:15  //server/test/File4.txt  DELETED
4  5.0  21-11-1991  03:18:31  //server/test/File5.txt    ADDED
  • Related