I have a dataframe:
d = {'hour': [1, 1, 2, 1, 2], 'value': ['alpha', 'beta', 'alpha', 'beta', 'gamma']}
df = pd.DataFrame(data=d)
and a dictionary:
di = {1: 'alpha', 2: 'gamma'}
How can I return a vector of True/False of rows where dictionary key (hour) matches its value in a dataframe's column value.
The result should be:
np.array([True, False, False, False, True])
CodePudding user response:
Are you looking for map
:
df['hour'].map(di) == df['value']
output:
0 True
1 False
2 False
3 False
4 True
dtype: bool
It's trivial to turn that series into numpy array if you really want.
CodePudding user response:
Solution
You could try one of the following methods. I tested them on Google Colab. And it appears Method-2/3
do reasonably better (~300 µs
) than Method-1
(1.71 ms
).
items = di.items()
## Method-1: using pandas' .apply()
%time df.apply(lambda x: tuple(x) in items, axis=1).to_numpy()
# CPU times: user 1.7 ms, sys: 0 ns, total: 1.7 ms, Wall time: 1.71 ms
# array([True, False, False, False, True])
## Method-2: using a list-comprehension (for loop)
%time np.array([tuple(item) in items for item in df.values.tolist()])
# CPU times: user 278 µs, sys: 28 µs, total: 306 µs, Wall time: 316 µs
# array([ True, False, False, False, True])
## Method-3: using map(func, iterable)
%time np.array(list(map(lambda x: tuple(x) in items, df.values.tolist())))
# CPU times: user 287 µs, sys: 0 ns, total: 287 µs, Wall time: 293 µs
# array([ True, False, False, False, True])
Note
If you see the time it takes for the following (the solution proposed by Quang Hong), takes longest compared to methods-1,2,3
above. Now, this could be because the data has very few rows, and may be in a regime where you have millions of rows, it is possible to find the following method (being vectorized) performing better. But this remains something to be tested with hard data.
%time df['hour'].map(di) == df['value']
# CPU times: user 2.37 ms, sys: 1.08 ms, total: 3.45 ms, Wall time: 3.45 ms