I have a pandas dataframe with two series that each contain arrays of strings:
A B
0 ['1','2','3'] ['1','2','3']
1 ['1','2','3'] ['1','3','2']
2 ['1','2'] ['1','3']
3 ['1','2'] ['1']
4 ['1','2'] ['3']
5 ['1'] ['2']
I want to compare the elements in columns A
and B
in a specific way. I want row C
to be True
where there is at least 1 element common to both A
and B
.
In the example above, I would expect the following:
A B C
0 ['1','2','3'] ['1','2','3'] True
1 ['1','2','3'] ['1','3','2'] True
2 ['1','2'] ['1','3'] True
3 ['1','2'] ['1'] True
4 ['1','2'] ['3'] False
5 ['1'] ['2'] False
How can I accomplish this? Pandas does not even allow a straight equality check, as in the following:
df['C'] = df['A'] != df['B']
I tried experimenting with numpy.where()
and numpy.array_equal()
, but I can't figure out how to use a custom comparator for equality, nor does it seem to compare 2D arrays element-wise.
Any help is much appreciated! Thanks!
EDIT: Thanks @Tom and @Erfan for such quick and helpful replies! I will go with the Series.apply(set)
method for now for readability, unless performance becomes a serious concern.
CodePudding user response:
I would use numpy.intersect1d
with bool
:
df["C"] = [bool(len(np.intersect1d(x,y))) for x,y in zip(df["A"], df["B"])]
Or as suggested by @Erfan, you can use set.intersection
:
df["C"] = [bool(set(a).intersection(set(b))) for a, b in zip(df["A"], df["B"])]
# Output :
print(df)
A B C
0 ['1', '2', '3'] ['1', '2', '3'] True
1 ['1', '2', '3'] ['1', '3', '2'] True
2 ['1', '2'] ['1', '3'] True
3 ['1', '2'] ['1'] True
4 ['1', '2'] ['3'] False
5 ['1'] ['2'] False
CodePudding user response:
If your entries were sets instead of lists, you could check if their intersection is. This is apparently possible by using the &
operator column-wise (if someone could share documentation on this, that would be appreciated!):
>>> df['A'].apply(set) & df['B'].apply(set)
0 True
1 True
2 True
3 True
4 False
5 False
dtype: bool
You could also instead just create a for loop to do this set comparison, following this answer and the comment by Erfan:
>>> [bool(set(a) & set(b)) for a, b in zip(df['A'], df['B'])]
[True, True, True, True, False, False]
The latter approach seems to be the quickest on my machine for this example:
%%timeit
df['C'] = [bool(set(a) & set(b)) for a, b in zip(df['A'], df['B'])]
54.2 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df['C'] = [bool(len(np.intersect1d(x,y))) for x,y in zip(df["A"], df["B"])]
129 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df['C'] = df['A'].apply(set) & df['B'].apply(set)
233 µs ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
CodePudding user response:
A very easy way would be to use .isin like so:
df['C'] = df['A'].isin(df['B'])
This will create a new column, C, that contains True for rows where at least one element in column A exists in column B, and False otherwise. Checking if A is in B is sufficient, because you are looking for elements that exist in both columns. Therefore you don't have to check the other condition (B in A).
I can't speak of the speed of this operation compared to the other suggested solutions, though.