Comparing lists within dataframe column to another list using numpy's where function-CodePudding

This is my first post at Stackoverflow, so thank you for the help. I am trying to replicate a code where I can match a list within a dataframe to another list, and if the dx (diagnosis) exists in the other list, then it returns a true. Test is the dataframe with all the dx codes, and test['allDX'] is the column that has the list of all dxcodes.

import pandas as pd
import numpy as np
#initiate a table
test=pd.DataFrame(np.array([['J123','K63','B45'],['E34','T65','G32'],['R12','T35','K12']]), columns=['Dx1','Dx2','Dx3'])
#create a new column that turns dx1,dx2,dx3 into a list
test['allDx']=test[['Dx1','Dx2','Dx3']].values.tolist()

This code works:

    #method 1 using apply, lambda function, works
    test['Check']=test.apply(lambda x: pd.Series(x['allDx']).isin(['J123','A','K12','D']).any() , axis=1)
    test
    Dx1  Dx2 Dx3       allDx            Check
0   J123 K63 B45     [J123, K63, B45]   True
1   E34  T65 G32     [E34, T65, G32]    False
2   R12 T35  K12     [R12, T35, K12]    True

You can see that row 0, 1 return true because J123 exists in row 0, and K12 exists in row 2.

However, this method takes a long time to run when doing over millions of rows, so I thought about using np.where, but it's not returning the results.

 #method 2, using np.where, does not work.
 test['Check']=np.where(test['allDx'].isin(['J123','A','K12','D']).any(), 'True','False')
 ---------------------------------------------------------------------------
 TypeError                                 Traceback (most recent call last)
 TypeError: unhashable type: 'list'

The above exception was the direct cause of the following exception:

Is there a way to perform the same comparison using np.where or another method that is faster than the lambda function?

Thanks!!

CodePudding user response：

You don't need the "allDx" column, just use np.isin:

test = pd.DataFrame(np.array([['J123','K63','B45'],
                               ['E34','T65','G32'],
                               ['R12','T35','K12']]), 
                    columns=['Dx1','Dx2','Dx3'])

test["Check"] = np.isin(test.to_numpy(),['J123','A','K12','D']).any(1)

>>> test
    Dx1  Dx2  Dx3  Check
0  J123  K63  B45   True
1   E34  T65  G32  False
2   R12  T35  K12   True

CodePudding user response：

If you don't have the columns with the single values but only allDx you can use a list comprehension and set operations that should be quite faster than apply:

target = ['J123','A','K12','D']
S = set(target)
test["Check"] = [bool(S.intersection(x)) for x in test['allDx']]

Alternative (potentially faster if the lists are large):

target = ['J123','A','K12','D']
S = set(target)
test["Check"] = [not S.isdisjoint(x) for x in test['allDx']]

Output:

    Dx1  Dx2  Dx3             allDx  Check
0  J123  K63  B45  [J123, K63, B45]   True
1   E34  T65  G32   [E34, T65, G32]  False
2   R12  T35  K12   [R12, T35, K12]   True

CodePudding user response：

This should work as well:

test.assign(Check = test.isin(l).any(axis=1))

Output:

    Dx1 Dx2 Dx3 Check
0   J123K63 B45 True
1   E34 T65 G32 False
2   R12 T35 K12 True