Home > front end >  How to check if two columns match some value
How to check if two columns match some value

Time:12-04

I have a dataframe:

d = {'hour': [1, 1, 2, 1, 2], 'value': ['alpha', 'beta', 'alpha', 'beta', 'gamma']}

df = pd.DataFrame(data=d)

and a dictionary:

di = {1: 'alpha', 2: 'gamma'}

How can I return a vector of True/False of rows where dictionary key (hour) matches its value in a dataframe's column value.

The result should be:

np.array([True, False, False, False, True])

CodePudding user response:

Are you looking for map:

df['hour'].map(di) == df['value']

output:

0     True
1    False
2    False
3    False
4     True
dtype: bool

It's trivial to turn that series into numpy array if you really want.

CodePudding user response:

Solution

You could try one of the following methods. I tested them on Google Colab. And it appears Method-2/3 do reasonably better (~300 µs) than Method-1 (1.71 ms).

items = di.items()

## Method-1: using pandas' .apply()
%time df.apply(lambda x: tuple(x) in items, axis=1).to_numpy()
# CPU times: user 1.7 ms, sys: 0 ns, total: 1.7 ms, Wall time: 1.71 ms
# array([True, False, False, False, True])

## Method-2: using a list-comprehension (for loop)
%time np.array([tuple(item) in items for item in df.values.tolist()])
# CPU times: user 278 µs, sys: 28 µs, total: 306 µs, Wall time: 316 µs
# array([ True, False, False, False,  True])

## Method-3: using map(func, iterable)
%time np.array(list(map(lambda x: tuple(x) in items, df.values.tolist())))
# CPU times: user 287 µs, sys: 0 ns, total: 287 µs, Wall time: 293 µs
# array([ True, False, False, False,  True])

Note

If you see the time it takes for the following (the solution proposed by Quang Hong), takes longest compared to methods-1,2,3 above. Now, this could be because the data has very few rows, and may be in a regime where you have millions of rows, it is possible to find the following method (being vectorized) performing better. But this remains something to be tested with hard data.

%time df['hour'].map(di) == df['value']
# CPU times: user 2.37 ms, sys: 1.08 ms, total: 3.45 ms, Wall time: 3.45 ms
  • Related