So I have two arrays that look like below:
x1 = np.array([['a','b','c'],['d','a','b'],['c','a,c','c']])
x2 = np.array(['d','c','d'])
I want to check if each element in x2
exists in a corresponding column in x1
. So I tried:
print((x1==x2).any(axis=0))
#array([ True, False, False])
Note that x2[1] in x1[2,1] == True
. The problem is, sometimes an element we're looking for is inside an element in x1
(where it can be identified if we split by comma). So my desired output is:
array([ True, True, False])
Is there a way to do it using a numpy (or pandas) native method?
CodePudding user response:
You can vectorize
a function to broadcast x2 in x1.split(',')
:
@np.vectorize
def f(a, b):
return b in a.split(',')
f(x1, x2).any(axis=0)
# array([ True, True, False])
Note that "vectorize" is a misnomer. This isn't true vectorization, just a convenient way to broadcast a custom function.
Since you mentioned pandas in parentheses, another option is to apply
a splitting/membership function to the columns of df = pd.DataFrame(x1)
.
However, the numpy function is significantly faster:
f(x1, x2).any(axis=0) # 24.2 µs ± 2.8 µs
df.apply(list_comp).any() # 913 µs ± 12.1 µs
df.apply(combine_in).any() # 1.8 ms ± 104 µs
df.apply(expand_eq_any).any() # 3.28 ms ± 751 µs
# use a list comprehension to do the splitting and membership checking:
def list_comp(col):
return [x2[col.name] in val.split(',') for val in col]
# split the whole column and use `combine` to check `x2 in x1`
def combine_in(col):
return col.str.split(',').combine(x2[col.name], lambda a, b: b in a)
# split the column into expanded columns and check the expanded rows for matches
def expand_eq_any(col):
return col.str.split(',', expand=True).eq(x2[col.name]).any(axis=1)