I'm trying to find elements of one DataFrame (df_other
) which match a column in another DataFrame (df
). In other words, I'd like to know where the values in df['a']
match the values in df_other['a']
for each row in df['a']
.
An example might be easier to explain the expected result:
>>> import pandas as pd
>>> import numpy as np
>>>
>>>
>>> df = pd.DataFrame({'a': ['x', 'y', 'z']})
>>> df
a
0 x
1 y
2 z
>>> df_other = pd.DataFrame({'a': ['x', 'x', 'y', 'z', 'z2'], 'c': [1, 2, 3, 4, 5]})
>>> df_other
a c
0 x 1
1 x 2
2 y 3
3 z 4
4 z2 5
>>>
>>>
>>> u = df_other['c'].unique()
>>> u
array([1, 2, 3, 4, 5])
>>> bm = np.ones((len(df), len(u)), dtype=bool)
>>> bm
array([[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]])
should yield a bitmap of
[
[1, 1, 0, 0, 0], # [1, 2] are df_other['c'] where df_other['a'] == df['a']
[0, 0, 1, 0, 0], # [3] matches
[0, 0, 0, 1, 0], # [4] matches
]
I'm looking for a fast numpy implementation that doesn't iterate through all rows (which is my current solution):
>>> df_other['a'] == df.loc[0, 'a']
0 True
1 True
2 False
3 False
4 False
Name: a, dtype: bool
>>>
>>>
>>> df_other['a'] == df.loc[1, 'a']
0 False
1 False
2 True
3 False
4 False
Name: a, dtype: bool
>>> df_other['a'] == df.loc[2, 'a']
0 False
1 False
2 False
3 True
4 False
Name: a, dtype: bool
Note: in the actual production code, there are many more column conditions ((df['a'] == df_other['a']) & (df['b'] == df_other['b'] & ...
), but they are generally less than the number of rows in df
, so I wouldn't mind a solution that loops over the conditions (and subsequently sets values in bm
to false).
Also, the bitmap should have the shape of (len(df), len(df_other['c'].unique))
.
CodePudding user response:
numpy broadcasting is so useful here:
bm = df_other.values[:, 0] == df.values
Output:
>>> bm
array([[ True, True, False, False, False],
[False, False, True, False, False],
[False, False, False, True, False]])
If you need it as ints:
>>> bm.astype(int)
array([[1, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]])