Lookup a value and if it is present in another df, return text in two new columns-CodePudding

I have gotten to a point with the following script and I wasn't able to take it all the way even with functions as the "is.in" possibly because I never used it prior to that.

Input df1:

    ID  Alternative ID
0   152503  009372
1   249774  249774
2   062005  196582
3   185704  185704
4   081231  081231
5   081231  062085
6   912568  222416
7   196782  195122

Input df2:

The idea is that I want to check whether the values of "ID" match with the ones on "Alternative ID" from df1. If they do it should return "Match" & "Correct" on two new columns named "Result_1" & "Result_2" respectively. For the ones that do not match, lookup whether they are present AT ALL in the "New_ID" column from df2. If they are return on those two columns mentioned above the values of "NEW Match" & "Good" respectively. If they are not there then return, "Not Match" & "Error".

For the first part of this task, this is the code that I used:

def compl(df1):

    if (df1['ID'] == df1['Alternative ID']):
        return 'Match', 'Correct'
    elif (df1['ID'] != df1['ID']):

Here I cannot find the next step to basically check if the values that didn't match are in the df2, etc.

df1[['Result_1', 'Result_2']] = df1.apply(compl, axis = 1, result_type = 'expand')

Desirable output ->

ID  Alternative ID  Result_1    Result_2
0   152503  009372  NEW Match   Good
1   249774  249774  Match       Correct
2   062005  196582  NEW Match   Good
3   185704  185704  Match       Correct
4   081231  062085  Match       Correct
5   912568  222416  Not Match   Error
6   196782  195122  Not Match   Error

Any suggestions/approaches would be greatly appreciated

CodePudding user response：

Use np.select with your conditions and desired values. For the trueness of each condition, the function select will map the value given.

import numpy as np

conditions = [
    df1['ID'] == df1['Alternative ID'],
    df1['ID'].isin(df2['New_ID'])
]
values_result1 = ['Match', 'New match']
values_result2 = ['Correct', 'Good']

df1['Result_1'] = np.select(conditions, values_result1, 'No match')
df1['Result_2'] = np.select(conditions, values_result2, 'Error')

Output

       ID  Alternative ID   Result_1 Result_2
0  152503            9372  New match     Good
1  249774          249774      Match  Correct
2   62005          196582  New match     Good
3  185704          185704      Match  Correct
4   81231           81231      Match  Correct
5   81231           62085  New match     Good
6  912568          222416   No match    Error
7  196782          195122   No match    Error

Note: Your answer with some more tweaks will work. But it will be a lot slower than the above-vectorized approach. Try to refrain from using apply until there's no other way.