My requirement is I have a large dataframe with millions of rows. I encoded all strings to numeric values in order to use numpys
vectorization to increase processing speed.
So I was looking at a way to quickly check if a number exists in another list column. Previously, I was using list comprehension with string values, but with after converting to np.arrays
was looking at similar function.
I stumbled across this link: check if values of a column are in values of another numpy array column in pandas
In order to the numpy.isin
, I tried running below code:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,5,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 5 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
When I enter:
np.isin(dt['col_a'], dt['col_b'])
The output is:
array([False, True, False, False, True])
Which is incorrect as the 3rd row has 5 in both columns col_a
and col_b
.
Where as if I change the value to 4 as below:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,4,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 4 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
and execute same code:
np.isin(dt['col_a'], dt['col_b'])
I get correct result:
array([False, True, True, False, True])
Can someone please let me know why it's giving different results.
CodePudding user response:
np.isin
for each element from dt['col_a']
checks whether it is present in the whole dt['col_b']
column, i.e.:
[
1 in dt['col_b'],
2 in dt['col_b'],
5 in dt['col_b'],
...
]
There's no 5
in dt['col_b']
but there's 4
From the docs
isin
is an element-wise function version of the python keyword in.isin(a, b)
is roughly equivalent tonp.array([item in b for item in a])
ifa
andb
are 1-D sequences.
Also, your issue is that you have an inconsistent dt['col_b']
column (some values are numbers some are lists). I think the easiest approach is to use apply:
def isin(row):
if isinstance(row['col_b'], int):
return row['col_a'] == row['col_b']
else:
return row['col_a'] in row['col_b']
dt.apply(isin, axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool
CodePudding user response:
Since col_b
not only has lists but also integers, you may need to use apply
and treat them differently:
( dt.apply(lambda x: x['col_a'] in x['col_b'] if type(x['col_b']) is list
else x['col_a'] == x['col_b'], axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool