Why numpy .isin function gives incorrect output-CodePudding

My requirement is I have a large dataframe with millions of rows. I encoded all strings to numeric values in order to use numpys vectorization to increase processing speed.

So I was looking at a way to quickly check if a number exists in another list column. Previously, I was using list comprehension with string values, but with after converting to np.arrays was looking at similar function.

I stumbled across this link: check if values of a column are in values of another numpy array column in pandas

In order to the numpy.isin, I tried running below code:

dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
                   'col_a': [1,2,5,1,2],
                   'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt

    id  col_a   col_b
0   a   1       2
1   a   2       2
2   a   5       [2, 5, 4]
3   b   1       4
4   b   2       [1, 5, 6, 3, 2]

When I enter:

np.isin(dt['col_a'], dt['col_b'])

The output is:

array([False,  True, False, False,  True])

Which is incorrect as the 3rd row has 5 in both columns col_a and col_b.

Where as if I change the value to 4 as below:

dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
                   'col_a': [1,2,4,1,2],
                   'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})

dt

    id  col_a   col_b
0   a   1       2
1   a   2       2
2   a   4       [2, 5, 4]
3   b   1       4
4   b   2       [1, 5, 6, 3, 2]

and execute same code:

np.isin(dt['col_a'], dt['col_b'])

I get correct result:

array([False,  True,  True, False,  True])

Can someone please let me know why it's giving different results.

CodePudding user response：

np.isin for each element from dt['col_a'] checks whether it is present in the whole dt['col_b'] column, i.e.:

[
   1 in dt['col_b'],
   2 in dt['col_b'],
   5 in dt['col_b'],
   ...
]

There's no 5 in dt['col_b'] but there's 4

From the docs

isin is an element-wise function version of the python keyword in. isin(a, b) is roughly equivalent to np.array([item in b for item in a]) if a and b are 1-D sequences.

Also, your issue is that you have an inconsistent dt['col_b'] column (some values are numbers some are lists). I think the easiest approach is to use apply:

def isin(row):
    if isinstance(row['col_b'], int):
        return row['col_a'] == row['col_b']
    else:
        return row['col_a'] in row['col_b']

dt.apply(isin, axis=1)

Output:

0    False
1     True
2     True
3    False
4     True
dtype: bool

CodePudding user response：

Since col_b not only has lists but also integers, you may need to use apply and treat them differently:

( dt.apply(lambda x: x['col_a'] in x['col_b'] if type(x['col_b']) is list 
                                             else x['col_a'] == x['col_b'], axis=1)

Output:

0    False
1     True
2     True
3    False
4     True
dtype: bool