Pandas: Search input data against values in dictionary-CodePudding

I have 2 data sources - config and input data

Config

 ------------ ------- 
| ContractID | PB ID |
 ------------ ------- 
| H9500      | 002   |
 ------------ ------- 
| H9500      | 008   |
 ------------ ------- 
| H3544      | 800   |
 ------------ ------- 
| H3544      | 801   |
 ------------ ------- 
| H3544      | 802   |
 ------------ -------

Input file

 ------- ------------ ------- 
| Index | ContractID | PB ID |
 ------- ------------ ------- 
| 1     | H9500      | 456   |
 ------- ------------ ------- 
| 2     | H9500      | 008   |
 ------- ------------ ------- 
| 3     | H9500      | 002   |
 ------- ------------ ------- 
| 4     | H3544      | 853   |
 ------- ------------ ------- 
| 5     | H3544      | 802   |
 ------- ------------ ------- 
| 6     | H4599      | 465   |
 ------- ------------ -------

I'm trying to loop through the input file and check if the Contract ID is valid or not. If a Contract ID = H9500 and PB ID= 456 is Invalid because that combination is not present in the config file. Contract ID = H9500 and PB ID= 008 is valid because that combination is present in config file.

I'm using the below logic

input_df=pd.DataFrame({'Index': [1,2,3,4,5,6], 
                      'ContractID' : ['H9500','H9500','H9500','H3544','H3544','H4599'],
                       'PBID': ['456','008','002','853','802','465']})

config_df=pd.DataFrame({'ContractID':['H9500','H9500','H3544','H3544','H3544'],
                        'PBID':['002','008','800','801','802']})

config_dict={k: list(v) for k,v in config_df.groupby("ContractID")["PBID"]}

def test_6_3(s):
    if config_dict.get(s["ContractID"]) and s["PBID"] not in config_dict.values():
        return "Invalid PB Contract"
    else:
        return "Valid"
    
input_df['test_6_3'] = input_df.apply(test_6_3, axis=1)

input_df

But I'm not getting the expected result

 ------- ------------ ------- ---------- 
| Index | ContractID | PB ID | test_6_3 |
 ------- ------------ ------- ---------- 
| 1     | H9500      | 456   | Invalid  |
 ------- ------------ ------- ---------- 
| 2     | H9500      | 008   | Valid    |
 ------- ------------ ------- ---------- 
| 3     | H9500      | 002   | Valid    |
 ------- ------------ ------- ---------- 
| 4     | H3544      | 853   | Invalid  |
 ------- ------------ ------- ---------- 
| 5     | H3544      | 802   | Valid    |
 ------- ------------ ------- ---------- 
| 6     | H4599      | 465   | Invalid  |
 ------- ------------ ------- ----------

CodePudding user response：

UPDATED (based on corrected input by OP):

This should do what your question asks:

df = input_df.join(config_df.assign(test_6_3='Valid').set_index(['ContractID', 'PBID']), on = ['ContractID', 'PBID']).fillna('Invalid')

Input:

   Index ContractID PBID
0      1      H9500  456
1      2      H9500  008
2      3      H9500  002
3      4      H3544  853
4      5      H3544  802
5      6      H4599  465
  ContractID PBID
0      H9500  002
1      H9500  008
2      H3544  800
3      H3544  801
4      H3544  802

Output:

   Index ContractID PBID test_6_3
0      1      H9500  456  Invalid
1      2      H9500  008    Valid
2      3      H9500  002    Valid
3      4      H3544  853  Invalid
4      5      H3544  802    Valid
5      6      H4599  465  Invalid

CodePudding user response：

names = ['ContractID', 'PBID']
config = dfc[names].apply(tuple, axis=1).to_list()
config

Output:

[('H9500', 2), ('H9500', 8), ('H3544', 800), ('H3544', 801), ('H3544', 802)]

code:

(
    df.assign(test_6_3=df[names].apply(lambda x: tuple(x) in config, axis=1))
    .assign(test_6_3=lambda x: x.test_6_3.map({True: 'Valid', False: 'Invalid'}))
)

Output:

   Index ContractID  PBID test_6_3
0      1      H9500   456  Invalid
1      2      H9500     8    Valid
2      3      H9500     2    Valid
3      4      H3544   853  Invalid
4      5      H3544   802    Valid
5      6      H4599   465  Invalid