Home > Enterprise >  How to compare 2 Dataframes on set columns using Groupby where the data is not organized ideally in
How to compare 2 Dataframes on set columns using Groupby where the data is not organized ideally in

Time:11-13

The problem that I am attempting to solve is to reconcile rates applied to the actual accounts shown in "AccountTable" versus what the should be set in "RateTable". For each Account its rate can be set at different levels, either at the Account level or at the Parent level. More than one account can be linked to the same parent but it would differ by currency for example. While I can get it to compare, my solution involves a lot of code which is repeated and its not scalable where this example only looks at 2 different grouping where I could have up to 9 different grouping combo's to compare.

Here is sample AccountTable:

import pandas as pd
import numpy as np
AccountTable = pd.DataFrame([[1234567890,456,'EUR',3.5],
                    [7854567890,15,'USD',2.7],
                    [9632587415,56,'GBP',1.4]],
columns = ['Account','ParentID','Cur','Rate'])
AccountTable

Output:

Account ParentID    Cur Rate
0   1234567890  456 EUR 3.5
1   7854567890  15  USD 2.7
2   9632587415  56  GBP 1.4

Here is the RateTable:

RateTable = pd.DataFrame([['Account',1234567890,'EUR',3.5], # Rate set at account level and shuold return a match
                    ['ParentID',456,'EUR',3.5], # should be Unused as match found at account level
                    ['ParentID',15,'USD',2.7],# rate set at account level and matches 
                    ['ParentID',15,'CAD',1.5],# CAD not in Account Table therfore unused 
                    ['Account',9876542190,'EUR',3.5], # Account Table therfore unused  
                    ['ParentID',56,'GBP',1.5]], # rate set on parent level but rates don't match so return should be mismatch here
columns = ['Level_Type','ID','Cur','Set_Rate'])

Output:

Level_Type  ID          Cur Set_Rate
0   Account 1234567890  EUR 3.5
1   ParentID 456        EUR 3.5
2   ParentID 15         USD 2.7
3   ParentID 15         CAD 1.5
4   Account 9876542190  EUR 3.5
5   ParentID 56         GBP 1.5

My solutions is as follows, where I split the RateTable into multiply Dataframes based on the different level. In this case 2 - Account level and Parent Level. Then I left join them independently to the AccountTable using the Groupby function and compare the rates.

option1 = ['Account']
option2 = ['ParentID']
AccountView = RateTable[RateTable['Level_Type'].isin(option1)]
ParentView = RateTable[RateTable['Level_Type'].isin(option2)]
AccountView = AccountView.rename(columns={'Set_Rate':'Account_Set_Rate'})
ParentView = ParentView.rename(columns={'Set_Rate':'Parent_Set_Rate'})
AccountView = AccountView.rename(columns={'ID':'Account_ID'})
ParentView = ParentView.rename(columns={'ID':'Parent_ID'})
# new view to identify matches at Account level Only 
df = pd.merge(AccountTable, AccountView, left_on=['Account','Cur'], right_on=['Account_ID','Cur'], how='left')
df['Account_level_RateMatch'] = np.where(df['Rate'] == df['Account_Set_Rate'],'1','0').astype(int) 
Account ParentID    Cur Rate    Level_Type  Account_ID  Account_Set_Rate    Account_level_RateMatch
0   1234567890  456 EUR 3.5     Account     1.234568e 09    3.5             1
1   7854567890  15  USD 2.7     NaN         NaN             NaN             0
2   9632587415  56  GBP 1.4     NaN         NaN             NaN             0

The above is repeated but matching on the parent level now:

df = pd.merge(AccountTable, ParentView, left_on=['ParentID','Cur'], right_on=['Parent_ID','Cur'], how='left')
df['Parent_level_RateMatch'] = np.where(df['Rate'] == df['Parent_Set_Rate'],'1','0').astype(int) # compare rates 

Output:

Account ParentID    Cur Rate    Level_Type  Parent_ID   Parent_Set_Rate Parent_level_RateMatch
0   1234567890  456 EUR 3.5     ParentID    456         3.5             1
1   7854567890  15  USD 2.7     ParentID    15          2.7             1
2   9632587415  56  GBP 1.4     ParentID    56          1.5             0

I need to have a better way to compare the Account Rates to the Rate table rather than doing separate views. Also, the logic needs to be if a match is found at the first level "Account level" its stops there and doesn't need to check the next level i.e. parent level such as in # row 1 it matches at both account and parent level.

Any thoughts or solutions would be greatly appreciated.

Desired Output:

Account ParentID    Cur Rate    IsMatch LevelFound
0   1234567890  456 EUR 3.5     1       Account
1   7854567890  15  USD 2.7     1       Parent
2   9632587415  56  GBP 1.4     0       Parent

CodePudding user response:

EDIT solution, similar to original one but to fit better OP's expected output

#define the order of the levels
ordered_levels = ['Account','ParentID'] 

# fnd all the matching rates
res = (
    pd.concat(
        [AccountTable
           .merge(RateTable.loc[RateTable['Level_Type'].eq(lvl),#row with good level
                                 # columns for comparision with AccountTable
                                ['ID','Cur','Set_Rate']]
                           .rename(columns={'ID':lvl, 'Cur':f'Cur_opt'}), 
                 on=lvl, how='inner')
           .query('Cur == Cur_opt') #EDIT to query same cur
           .assign(LevelFound=lvl, 
                   #EDIT if rate not the same then 0
                   Is_Match=lambda x: x['Rate'].eq(x['Set_Rate']).astype(int)) 
         for lvl in ordered_levels]) # do the merge operation on each level
    #EDIT for selecting first 1 if any, then first 0
    .sort_values('Is_Match', ascending=False) 
    # keep the first matched per initial AccountTable or higher level non-match
    .drop_duplicates(ordered_levels) 
    [AccountTable.columns.tolist()   ['LevelFound','Is_Match']] 
)
print(res)
#       Account  ParentID  Cur  Rate LevelFound  Is_Match
# 0  1234567890       456  EUR   3.5    Account         1
# 1  7854567890        15  USD   2.7   ParentID         1
# 3  9632587415        56  GBP   1.4   ParentID         0

Original solution

Here is a solution, you need first to define the order of the levels, then you can loop over each one, select the rows wanted in RateTable, then merge with Account, and keep only the matched cur and rate (query). concat all the matched data and keep only the first match per AccountTable initial row.

#define the order of the levels
ordered_levels = ['Account','ParentID'] 

# fnd all the matching rates
matched = (
    pd.concat(
        [AccountTable
           .merge(RateTable.loc[RateTable['Level_Type'].eq(lvl),#row with good level
                                # columns for comparision with AccountTable
                                ['ID','Cur','Set_Rate']] 
                           .rename(columns={'ID':lvl, 'Cur':f'Cur_opt'}), 
                 on=lvl, how='inner')
           # keep only the matching data
           .query('Cur == Cur_opt and Rate == Set_Rate')
           # add the two columns for the ouput
           .assign(LevelFound=opt, Is_Match=1)
         for lvl in ordered_levels]) # do the merge operation on each level
    .drop_duplicates(ordered_levels) # keep the first matched per initial AccountTable
    [AccountTable.columns.tolist()   ['LevelFound','Is_Match']] 
)
print(matched) # note that the row wihtout match is missing
#       Account  ParentID  Cur  Rate LevelFound  Is_Match
# 0  1234567890       456  EUR   3.5    Account         1
# 1  7854567890        15  USD   2.7   ParentID         1 

If you want to add the rows with no match, then you can do

res = AccountTable.merge(matched, how='left')
print(res)
#       Account  ParentID  Cur  Rate LevelFound  Is_Match
# 0  1234567890       456  EUR   3.5    Account       1.0
# 1  7854567890        15  USD   2.7   ParentID       1.0
# 2  9632587415        56  GBP   1.4        NaN       NaN
  • Related