Home > front end >  python pandas - compare two dataframes in multiple ways by custom ID
python pandas - compare two dataframes in multiple ways by custom ID

Time:01-05

I need to check what has changed over time in two datasets with different date:

Prior date:

Date      ID        Value      Category  Subcategory
30-Nov    0001      100.00     A         A100
30-Nov    0002      200.00     B         B120
30-Nov    0003      300.00     C         C300
30-Nov    0004      450.00     D         D900
30-Nov    0005      500.00     D         D900

Current date:

Date      ID        Value      Category  Subcategory
31-Dec    0001      100.00     A         A100
31-Dec    0002      200.00     B         B101
31-Dec    0003      300.00     C         C300
31-Dec    0004      400.00     E         E900
31-Dec    0006      600.00     D         D900

Now I need to create 4 dataframes:

  1. Changes in Value:
Date      ID        Value      Category  Subcategory Prior Value
31-Dec    0004      400.00     E         E900        450.00
  1. Changes in Category:
Date      ID        Value      Category  Subcategory Prior Category
31-Dec    0004      400.00     E         E900        D
  1. Changes in Subcategory, but only if category didn't change:
Date      ID        Value      Category  Subcategory Prior Subcategory
31-Dec    0002      200.00     B         B101        B120
  1. Items population change:
Date      ID        Value      Category  Subcategory
31-Dec    0006      600.00     D         D900
30-Nov    0005      500.00     D         D900

I think I should first run the population check and exclude those breaks, so I will have only two datasets with identical ID sets. I will follow the examples from here: Comparing two dataframes and getting the differences

For comparing 1to1 values I found a numpy piece of code but it compares it by default index, not by ID, how to do it using my ID column as record identifier? This is going to be a large dataset and I can't base it on default index.

value_df = current_df
value_df['prior value'] = np.where(prior_df['Value'] == current_df['Value'], 'Match', prior_df['Value'])
value_df = value_df[value_df['prior value'] != 'Match']

For multiple conditions do I have to filter it out step by step (first filter out category change, then filter subcategory change) or can I use AND to concatenate the conditions?

below is the code for creating dataframes:

prior_data = {'Date': ['30-Nov','30-Nov','30-Nov','30-Nov', '30-Nov'],
          'ID': ['0001','0002','0003','0004', '0005'],
          'Value' : [100.00, 200.00, 300.00, 450.00, 500.00],
          'Category' : ['A','B','C','D','D'],
          'Subcategory' : ['A100','B120','C300','D900','D900']}


current_data = {'Date': ['31-Dec','31-Dec','31-Dec','31-Dec','31-Dec'],
          'ID': ['0001','0002','0003','0004', '0006'],
          'Value' : [100.00, 200.00, 300.00, 400.00, 600.00],
          'Category' : ['A','B','C','E','D'],
          'Subcategory' : ['A100','B101','C300','E900','D900']}

prior_df = pd.DataFrame(prior_data)
current_df = pd.DataFrame(current_data)

CodePudding user response:

I am not sure if it is the fastest possible solution, but this problem seems to call for pd.merge. As you say, let's first deal with things that are in one dataframe but not the other:

def get_only_left(df1, df2):
    left_merge = pd.merge(df1, df2, on='ID', suffixes=('', '_other'), how='left')
    added_columns = [c   '_other' for c in df1.columns if c != 'ID']
    mask = left_merge.loc[:, added_columns].isna().all(axis=1)
    return left_merge[mask].drop(added_columns, axis=1)

pd.concat([get_only_left(prior_df, current_df), get_only_left(current_df, prior_df)])

This gives

     Date    ID  Value Category Subcategory
4  30-Nov  0005  500.0        D        D900
4  31-Dec  0006  600.0        D        D900

Then, let's deal with properly changing values.

columns = list(current_df.columns)
df = pd.merge(current_df, prior_df, on='ID', suffixes=('', '_prior'), how='inner')
mask = df['Value'] != df['Value_prior']
df[mask].loc[:, columns   ['Value_prior']]

This gives

     Date    ID  Value Category Subcategory  Value_prior
3  31-Dec  0004  400.0        E        E900        450.0

Then similarly:

mask = df['Category'] != df['Category_prior']
df[mask].loc[:, columns   ['Category_prior']]

gives

     Date    ID  Value Category Subcategory Category_prior
3  31-Dec  0004  400.0        E        E900              D

And finally

import numpy as np
mask = np.logical_and(df['Category'] == df['Category_prior'], df['Subcategory'] != df['Subcategory_prior'])
df[mask].loc[:, columns   ['Subcategory_prior']]

gives

     Date    ID  Value Category Subcategory Subcategory_prior
1  31-Dec  0002  200.0        B        B101              B120
  •  Tags:  
  • Related