Home > Mobile >  Comparing lists within dataframe column to another list using numpy's where function
Comparing lists within dataframe column to another list using numpy's where function

Time:05-07

This is my first post at Stackoverflow, so thank you for the help. I am trying to replicate a code where I can match a list within a dataframe to another list, and if the dx (diagnosis) exists in the other list, then it returns a true. Test is the dataframe with all the dx codes, and test['allDX'] is the column that has the list of all dxcodes.

import pandas as pd
import numpy as np
#initiate a table
test=pd.DataFrame(np.array([['J123','K63','B45'],['E34','T65','G32'],['R12','T35','K12']]), columns=['Dx1','Dx2','Dx3'])
#create a new column that turns dx1,dx2,dx3 into a list
test['allDx']=test[['Dx1','Dx2','Dx3']].values.tolist()

This code works:

    #method 1 using apply, lambda function, works
    test['Check']=test.apply(lambda x: pd.Series(x['allDx']).isin(['J123','A','K12','D']).any() , axis=1)
    test
    Dx1  Dx2 Dx3       allDx            Check
0   J123 K63 B45     [J123, K63, B45]   True
1   E34  T65 G32     [E34, T65, G32]    False
2   R12 T35  K12     [R12, T35, K12]    True

You can see that row 0, 1 return true because J123 exists in row 0, and K12 exists in row 2.

However, this method takes a long time to run when doing over millions of rows, so I thought about using np.where, but it's not returning the results.

 #method 2, using np.where, does not work.
 test['Check']=np.where(test['allDx'].isin(['J123','A','K12','D']).any(), 'True','False')
 ---------------------------------------------------------------------------
 TypeError                                 Traceback (most recent call last)
 TypeError: unhashable type: 'list'

The above exception was the direct cause of the following exception:

Is there a way to perform the same comparison using np.where or another method that is faster than the lambda function?

Thanks!!

CodePudding user response:

You don't need the "allDx" column, just use np.isin:

test = pd.DataFrame(np.array([['J123','K63','B45'],
                               ['E34','T65','G32'],
                               ['R12','T35','K12']]), 
                    columns=['Dx1','Dx2','Dx3'])

test["Check"] = np.isin(test.to_numpy(),['J123','A','K12','D']).any(1)

>>> test
    Dx1  Dx2  Dx3  Check
0  J123  K63  B45   True
1   E34  T65  G32  False
2   R12  T35  K12   True

CodePudding user response:

If you don't have the columns with the single values but only allDx you can use a list comprehension and set operations that should be quite faster than apply:

target = ['J123','A','K12','D']
S = set(target)
test["Check"] = [bool(S.intersection(x)) for x in test['allDx']]

Alternative (potentially faster if the lists are large):

target = ['J123','A','K12','D']
S = set(target)
test["Check"] = [not S.isdisjoint(x) for x in test['allDx']]

Output:

    Dx1  Dx2  Dx3             allDx  Check
0  J123  K63  B45  [J123, K63, B45]   True
1   E34  T65  G32   [E34, T65, G32]  False
2   R12  T35  K12   [R12, T35, K12]   True

CodePudding user response:

This should work as well:

test.assign(Check = test.isin(l).any(axis=1))

Output:

    Dx1 Dx2 Dx3 Check
0   J123K63 B45 True
1   E34 T65 G32 False
2   R12 T35 K12 True
  • Related