Efficient selection of values in numpy-CodePudding

I'm trying to find elements of one DataFrame (df_other) which match a column in another DataFrame (df). In other words, I'd like to know where the values in df['a'] match the values in df_other['a'] for each row in df['a'].

An example might be easier to explain the expected result:

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> 
>>> df = pd.DataFrame({'a': ['x', 'y', 'z']})
>>> df
   a
0  x
1  y
2  z
>>> df_other = pd.DataFrame({'a': ['x', 'x', 'y', 'z', 'z2'], 'c': [1, 2, 3, 4, 5]})
>>> df_other
    a  c
0   x  1
1   x  2
2   y  3
3   z  4
4  z2  5
>>> 
>>> 
>>> u = df_other['c'].unique()
>>> u
array([1, 2, 3, 4, 5])
>>> bm = np.ones((len(df), len(u)), dtype=bool)
>>> bm
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

should yield a bitmap of

[
 [1, 1, 0, 0, 0], # [1, 2] are df_other['c'] where df_other['a'] == df['a']
 [0, 0, 1, 0, 0], # [3] matches
 [0, 0, 0, 1, 0], # [4] matches
]

I'm looking for a fast numpy implementation that doesn't iterate through all rows (which is my current solution):

>>> df_other['a'] == df.loc[0, 'a']
0     True
1     True
2    False
3    False
4    False
Name: a, dtype: bool
>>> 
>>> 
>>> df_other['a'] == df.loc[1, 'a']
0    False
1    False
2     True
3    False
4    False
Name: a, dtype: bool
>>> df_other['a'] == df.loc[2, 'a']
0    False
1    False
2    False
3     True
4    False
Name: a, dtype: bool

Note: in the actual production code, there are many more column conditions ((df['a'] == df_other['a']) & (df['b'] == df_other['b'] & ...), but they are generally less than the number of rows in df, so I wouldn't mind a solution that loops over the conditions (and subsequently sets values in bm to false).

Also, the bitmap should have the shape of (len(df), len(df_other['c'].unique)).

CodePudding user response：

numpy broadcasting is so useful here:

bm = df_other.values[:, 0] == df.values

Output:

>>> bm
array([[ True,  True, False, False, False],
       [False, False,  True, False, False],
       [False, False, False,  True, False]])

If you need it as ints:

>>> bm.astype(int)
array([[1, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0]])