Home > Enterprise >  Pandas: compare two series of arrays with custom comparator
Pandas: compare two series of arrays with custom comparator

Time:12-14

I have a pandas dataframe with two series that each contain arrays of strings:

   A               B            
0  ['1','2','3']   ['1','2','3']
1  ['1','2','3']   ['1','3','2']
2  ['1','2']       ['1','3']    
3  ['1','2']       ['1']        
4  ['1','2']       ['3']        
5  ['1']           ['2']        

I want to compare the elements in columns A and B in a specific way. I want row C to be True where there is at least 1 element common to both A and B.

In the example above, I would expect the following:

   A               B                  C
0  ['1','2','3']   ['1','2','3']      True
1  ['1','2','3']   ['1','3','2']      True
2  ['1','2']       ['1','3']          True
3  ['1','2']       ['1']              True
4  ['1','2']       ['3']              False
5  ['1']           ['2']              False

How can I accomplish this? Pandas does not even allow a straight equality check, as in the following:

df['C'] = df['A'] != df['B']

I tried experimenting with numpy.where() and numpy.array_equal(), but I can't figure out how to use a custom comparator for equality, nor does it seem to compare 2D arrays element-wise.

Any help is much appreciated! Thanks!

EDIT: Thanks @Tom and @Erfan for such quick and helpful replies! I will go with the Series.apply(set) method for now for readability, unless performance becomes a serious concern.

CodePudding user response:

I would use numpy.intersect1d with bool :

df["C"] = [bool(len(np.intersect1d(x,y))) for x,y in zip(df["A"], df["B"])]

Or as suggested by @Erfan, you can use set.intersection :

df["C"] = [bool(set(a).intersection(set(b))) for a, b in zip(df["A"], df["B"])]

# Output :

print(df)

                 A                B      C
0  ['1', '2', '3']  ['1', '2', '3']   True
1  ['1', '2', '3']  ['1', '3', '2']   True
2       ['1', '2']       ['1', '3']   True
3       ['1', '2']            ['1']   True
4       ['1', '2']            ['3']  False
5            ['1']            ['2']  False

CodePudding user response:

If your entries were sets instead of lists, you could check if their intersection is. This is apparently possible by using the & operator column-wise (if someone could share documentation on this, that would be appreciated!):

>>> df['A'].apply(set) & df['B'].apply(set)
0     True
1     True
2     True
3     True
4    False
5    False
dtype: bool

You could also instead just create a for loop to do this set comparison, following this answer and the comment by Erfan:

>>> [bool(set(a) & set(b)) for a, b in zip(df['A'], df['B'])]
[True, True, True, True, False, False]

The latter approach seems to be the quickest on my machine for this example:

%%timeit
df['C'] = [bool(set(a) & set(b)) for a, b in zip(df['A'], df['B'])]
54.2 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df['C'] = [bool(len(np.intersect1d(x,y))) for x,y in zip(df["A"], df["B"])]
129 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df['C'] = df['A'].apply(set) & df['B'].apply(set)
233 µs ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

CodePudding user response:

A very easy way would be to use .isin like so:

df['C'] = df['A'].isin(df['B'])

This will create a new column, C, that contains True for rows where at least one element in column A exists in column B, and False otherwise. Checking if A is in B is sufficient, because you are looking for elements that exist in both columns. Therefore you don't have to check the other condition (B in A).

I can't speak of the speed of this operation compared to the other suggested solutions, though.

  • Related