I need to check what has changed over time in two datasets with different date:
Prior date:
Date ID Value Category Subcategory
30-Nov 0001 100.00 A A100
30-Nov 0002 200.00 B B120
30-Nov 0003 300.00 C C300
30-Nov 0004 450.00 D D900
30-Nov 0005 500.00 D D900
Current date:
Date ID Value Category Subcategory
31-Dec 0001 100.00 A A100
31-Dec 0002 200.00 B B101
31-Dec 0003 300.00 C C300
31-Dec 0004 400.00 E E900
31-Dec 0006 600.00 D D900
Now I need to create 4 dataframes:
- Changes in Value:
Date ID Value Category Subcategory Prior Value
31-Dec 0004 400.00 E E900 450.00
- Changes in Category:
Date ID Value Category Subcategory Prior Category
31-Dec 0004 400.00 E E900 D
- Changes in Subcategory, but only if category didn't change:
Date ID Value Category Subcategory Prior Subcategory
31-Dec 0002 200.00 B B101 B120
- Items population change:
Date ID Value Category Subcategory
31-Dec 0006 600.00 D D900
30-Nov 0005 500.00 D D900
I think I should first run the population check and exclude those breaks, so I will have only two datasets with identical ID sets. I will follow the examples from here: Comparing two dataframes and getting the differences
For comparing 1to1 values I found a numpy piece of code but it compares it by default index, not by ID, how to do it using my ID column as record identifier? This is going to be a large dataset and I can't base it on default index.
value_df = current_df
value_df['prior value'] = np.where(prior_df['Value'] == current_df['Value'], 'Match', prior_df['Value'])
value_df = value_df[value_df['prior value'] != 'Match']
For multiple conditions do I have to filter it out step by step (first filter out category change, then filter subcategory change) or can I use AND to concatenate the conditions?
below is the code for creating dataframes:
prior_data = {'Date': ['30-Nov','30-Nov','30-Nov','30-Nov', '30-Nov'],
'ID': ['0001','0002','0003','0004', '0005'],
'Value' : [100.00, 200.00, 300.00, 450.00, 500.00],
'Category' : ['A','B','C','D','D'],
'Subcategory' : ['A100','B120','C300','D900','D900']}
current_data = {'Date': ['31-Dec','31-Dec','31-Dec','31-Dec','31-Dec'],
'ID': ['0001','0002','0003','0004', '0006'],
'Value' : [100.00, 200.00, 300.00, 400.00, 600.00],
'Category' : ['A','B','C','E','D'],
'Subcategory' : ['A100','B101','C300','E900','D900']}
prior_df = pd.DataFrame(prior_data)
current_df = pd.DataFrame(current_data)
CodePudding user response:
I am not sure if it is the fastest possible solution, but this problem seems to call for pd.merge
. As you say, let's first deal with things that are in one dataframe but not the other:
def get_only_left(df1, df2):
left_merge = pd.merge(df1, df2, on='ID', suffixes=('', '_other'), how='left')
added_columns = [c '_other' for c in df1.columns if c != 'ID']
mask = left_merge.loc[:, added_columns].isna().all(axis=1)
return left_merge[mask].drop(added_columns, axis=1)
pd.concat([get_only_left(prior_df, current_df), get_only_left(current_df, prior_df)])
This gives
Date ID Value Category Subcategory
4 30-Nov 0005 500.0 D D900
4 31-Dec 0006 600.0 D D900
Then, let's deal with properly changing values.
columns = list(current_df.columns)
df = pd.merge(current_df, prior_df, on='ID', suffixes=('', '_prior'), how='inner')
mask = df['Value'] != df['Value_prior']
df[mask].loc[:, columns ['Value_prior']]
This gives
Date ID Value Category Subcategory Value_prior
3 31-Dec 0004 400.0 E E900 450.0
Then similarly:
mask = df['Category'] != df['Category_prior']
df[mask].loc[:, columns ['Category_prior']]
gives
Date ID Value Category Subcategory Category_prior
3 31-Dec 0004 400.0 E E900 D
And finally
import numpy as np
mask = np.logical_and(df['Category'] == df['Category_prior'], df['Subcategory'] != df['Subcategory_prior'])
df[mask].loc[:, columns ['Subcategory_prior']]
gives
Date ID Value Category Subcategory Subcategory_prior
1 31-Dec 0002 200.0 B B101 B120