This is my first post at Stackoverflow, so thank you for the help. I am trying to replicate a code where I can match a list within a dataframe to another list, and if the dx (diagnosis) exists in the other list, then it returns a true. Test is the dataframe with all the dx codes, and test['allDX'] is the column that has the list of all dxcodes.
import pandas as pd
import numpy as np
#initiate a table
test=pd.DataFrame(np.array([['J123','K63','B45'],['E34','T65','G32'],['R12','T35','K12']]), columns=['Dx1','Dx2','Dx3'])
#create a new column that turns dx1,dx2,dx3 into a list
test['allDx']=test[['Dx1','Dx2','Dx3']].values.tolist()
This code works:
#method 1 using apply, lambda function, works
test['Check']=test.apply(lambda x: pd.Series(x['allDx']).isin(['J123','A','K12','D']).any() , axis=1)
test
Dx1 Dx2 Dx3 allDx Check
0 J123 K63 B45 [J123, K63, B45] True
1 E34 T65 G32 [E34, T65, G32] False
2 R12 T35 K12 [R12, T35, K12] True
You can see that row 0, 1 return true because J123 exists in row 0, and K12 exists in row 2.
However, this method takes a long time to run when doing over millions of rows, so I thought about using np.where, but it's not returning the results.
#method 2, using np.where, does not work.
test['Check']=np.where(test['allDx'].isin(['J123','A','K12','D']).any(), 'True','False')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: unhashable type: 'list'
The above exception was the direct cause of the following exception:
Is there a way to perform the same comparison using np.where or another method that is faster than the lambda function?
Thanks!!
CodePudding user response:
You don't need the "allDx" column, just use np.isin
:
test = pd.DataFrame(np.array([['J123','K63','B45'],
['E34','T65','G32'],
['R12','T35','K12']]),
columns=['Dx1','Dx2','Dx3'])
test["Check"] = np.isin(test.to_numpy(),['J123','A','K12','D']).any(1)
>>> test
Dx1 Dx2 Dx3 Check
0 J123 K63 B45 True
1 E34 T65 G32 False
2 R12 T35 K12 True
CodePudding user response:
If you don't have the columns with the single values but only allDx
you can use a list comprehension and set
operations that should be quite faster than apply
:
target = ['J123','A','K12','D']
S = set(target)
test["Check"] = [bool(S.intersection(x)) for x in test['allDx']]
Alternative (potentially faster if the lists are large):
target = ['J123','A','K12','D']
S = set(target)
test["Check"] = [not S.isdisjoint(x) for x in test['allDx']]
Output:
Dx1 Dx2 Dx3 allDx Check
0 J123 K63 B45 [J123, K63, B45] True
1 E34 T65 G32 [E34, T65, G32] False
2 R12 T35 K12 [R12, T35, K12] True
CodePudding user response:
This should work as well:
test.assign(Check = test.isin(l).any(axis=1))
Output:
Dx1 Dx2 Dx3 Check
0 J123K63 B45 True
1 E34 T65 G32 False
2 R12 T35 K12 True