Home > Software design >  Lookup a value and if it is present in another df, return text in two new columns
Lookup a value and if it is present in another df, return text in two new columns

Time:12-15

I have gotten to a point with the following script and I wasn't able to take it all the way even with functions as the "is.in" possibly because I never used it prior to that.

Input df1:

    ID  Alternative ID
0   152503  009372
1   249774  249774
2   062005  196582
3   185704  185704
4   081231  081231
5   081231  062085
6   912568  222416
7   196782  195122

Input df2:

    New_ID
0   498109
1   081231
2   231051
3   062005
4   152503
5   967272
6   875612

The idea is that I want to check whether the values of "ID" match with the ones on "Alternative ID" from df1. If they do it should return "Match" & "Correct" on two new columns named "Result_1" & "Result_2" respectively. For the ones that do not match, lookup whether they are present AT ALL in the "New_ID" column from df2. If they are return on those two columns mentioned above the values of "NEW Match" & "Good" respectively. If they are not there then return, "Not Match" & "Error".

For the first part of this task, this is the code that I used:

def compl(df1):

    if (df1['ID'] == df1['Alternative ID']):
        return 'Match', 'Correct'
    elif (df1['ID'] != df1['ID']):

Here I cannot find the next step to basically check if the values that didn't match are in the df2, etc.

df1[['Result_1', 'Result_2']] = df1.apply(compl, axis = 1, result_type = 'expand')

Desirable output ->

ID  Alternative ID  Result_1    Result_2
0   152503  009372  NEW Match   Good
1   249774  249774  Match       Correct
2   062005  196582  NEW Match   Good
3   185704  185704  Match       Correct
4   081231  062085  Match       Correct
5   912568  222416  Not Match   Error
6   196782  195122  Not Match   Error

Any suggestions/approaches would be greatly appreciated

CodePudding user response:

Use np.select with your conditions and desired values. For the trueness of each condition, the function select will map the value given.

import numpy as np

conditions = [
    df1['ID'] == df1['Alternative ID'],
    df1['ID'].isin(df2['New_ID'])
]
values_result1 = ['Match', 'New match']
values_result2 = ['Correct', 'Good']

df1['Result_1'] = np.select(conditions, values_result1, 'No match')
df1['Result_2'] = np.select(conditions, values_result2, 'Error')

Output

       ID  Alternative ID   Result_1 Result_2
0  152503            9372  New match     Good
1  249774          249774      Match  Correct
2   62005          196582  New match     Good
3  185704          185704      Match  Correct
4   81231           81231      Match  Correct
5   81231           62085  New match     Good
6  912568          222416   No match    Error
7  196782          195122   No match    Error

Note: Your answer with some more tweaks will work. But it will be a lot slower than the above-vectorized approach. Try to refrain from using apply until there's no other way.

  • Related