I have 2 data sources - config and input data
Config
------------ -------
| ContractID | PB ID |
------------ -------
| H9500 | 002 |
------------ -------
| H9500 | 008 |
------------ -------
| H3544 | 800 |
------------ -------
| H3544 | 801 |
------------ -------
| H3544 | 802 |
------------ -------
Input file
------- ------------ -------
| Index | ContractID | PB ID |
------- ------------ -------
| 1 | H9500 | 456 |
------- ------------ -------
| 2 | H9500 | 008 |
------- ------------ -------
| 3 | H9500 | 002 |
------- ------------ -------
| 4 | H3544 | 853 |
------- ------------ -------
| 5 | H3544 | 802 |
------- ------------ -------
| 6 | H4599 | 465 |
------- ------------ -------
I'm trying to loop through the input file and check if the Contract ID is valid or not. If a Contract ID = H9500
and PB ID= 456
is Invalid because that combination is not present in the config file. Contract ID = H9500
and PB ID= 008
is valid because that combination is present in config file.
I'm using the below logic
input_df=pd.DataFrame({'Index': [1,2,3,4,5,6],
'ContractID' : ['H9500','H9500','H9500','H3544','H3544','H4599'],
'PBID': ['456','008','002','853','802','465']})
config_df=pd.DataFrame({'ContractID':['H9500','H9500','H3544','H3544','H3544'],
'PBID':['002','008','800','801','802']})
config_dict={k: list(v) for k,v in config_df.groupby("ContractID")["PBID"]}
def test_6_3(s):
if config_dict.get(s["ContractID"]) and s["PBID"] not in config_dict.values():
return "Invalid PB Contract"
else:
return "Valid"
input_df['test_6_3'] = input_df.apply(test_6_3, axis=1)
input_df
But I'm not getting the expected result
------- ------------ ------- ----------
| Index | ContractID | PB ID | test_6_3 |
------- ------------ ------- ----------
| 1 | H9500 | 456 | Invalid |
------- ------------ ------- ----------
| 2 | H9500 | 008 | Valid |
------- ------------ ------- ----------
| 3 | H9500 | 002 | Valid |
------- ------------ ------- ----------
| 4 | H3544 | 853 | Invalid |
------- ------------ ------- ----------
| 5 | H3544 | 802 | Valid |
------- ------------ ------- ----------
| 6 | H4599 | 465 | Invalid |
------- ------------ ------- ----------
CodePudding user response:
UPDATED (based on corrected input by OP):
This should do what your question asks:
df = input_df.join(config_df.assign(test_6_3='Valid').set_index(['ContractID', 'PBID']), on = ['ContractID', 'PBID']).fillna('Invalid')
Input:
Index ContractID PBID
0 1 H9500 456
1 2 H9500 008
2 3 H9500 002
3 4 H3544 853
4 5 H3544 802
5 6 H4599 465
ContractID PBID
0 H9500 002
1 H9500 008
2 H3544 800
3 H3544 801
4 H3544 802
Output:
Index ContractID PBID test_6_3
0 1 H9500 456 Invalid
1 2 H9500 008 Valid
2 3 H9500 002 Valid
3 4 H3544 853 Invalid
4 5 H3544 802 Valid
5 6 H4599 465 Invalid
CodePudding user response:
names = ['ContractID', 'PBID']
config = dfc[names].apply(tuple, axis=1).to_list()
config
Output:
[('H9500', 2), ('H9500', 8), ('H3544', 800), ('H3544', 801), ('H3544', 802)]
code:
(
df.assign(test_6_3=df[names].apply(lambda x: tuple(x) in config, axis=1))
.assign(test_6_3=lambda x: x.test_6_3.map({True: 'Valid', False: 'Invalid'}))
)
Output:
Index ContractID PBID test_6_3
0 1 H9500 456 Invalid
1 2 H9500 8 Valid
2 3 H9500 2 Valid
3 4 H3544 853 Invalid
4 5 H3544 802 Valid
5 6 H4599 465 Invalid