I have gotten to a point with the following script and I wasn't able to take it all the way even with functions as the "is.in" possibly because I never used it prior to that.
Input df1:
ID Alternative ID
0 152503 009372
1 249774 249774
2 062005 196582
3 185704 185704
4 081231 081231
5 081231 062085
6 912568 222416
7 196782 195122
Input df2:
New_ID
0 498109
1 081231
2 231051
3 062005
4 152503
5 967272
6 875612
The idea is that I want to check whether the values of "ID" match with the ones on "Alternative ID" from df1. If they do it should return "Match" & "Correct" on two new columns named "Result_1" & "Result_2" respectively. For the ones that do not match, lookup whether they are present AT ALL in the "New_ID" column from df2. If they are return on those two columns mentioned above the values of "NEW Match" & "Good" respectively. If they are not there then return, "Not Match" & "Error".
For the first part of this task, this is the code that I used:
def compl(df1):
if (df1['ID'] == df1['Alternative ID']):
return 'Match', 'Correct'
elif (df1['ID'] != df1['ID']):
Here I cannot find the next step to basically check if the values that didn't match are in the df2, etc.
df1[['Result_1', 'Result_2']] = df1.apply(compl, axis = 1, result_type = 'expand')
Desirable output ->
ID Alternative ID Result_1 Result_2
0 152503 009372 NEW Match Good
1 249774 249774 Match Correct
2 062005 196582 NEW Match Good
3 185704 185704 Match Correct
4 081231 062085 Match Correct
5 912568 222416 Not Match Error
6 196782 195122 Not Match Error
Any suggestions/approaches would be greatly appreciated
CodePudding user response:
Use np.select
with your conditions and desired values. For the trueness of each condition, the function select
will map the value given.
import numpy as np
conditions = [
df1['ID'] == df1['Alternative ID'],
df1['ID'].isin(df2['New_ID'])
]
values_result1 = ['Match', 'New match']
values_result2 = ['Correct', 'Good']
df1['Result_1'] = np.select(conditions, values_result1, 'No match')
df1['Result_2'] = np.select(conditions, values_result2, 'Error')
Output
ID Alternative ID Result_1 Result_2
0 152503 9372 New match Good
1 249774 249774 Match Correct
2 62005 196582 New match Good
3 185704 185704 Match Correct
4 81231 81231 Match Correct
5 81231 62085 New match Good
6 912568 222416 No match Error
7 196782 195122 No match Error
Note: Your answer with some more tweaks will work. But it will be a lot slower than the above-vectorized approach. Try to refrain from using apply
until there's no other way.