I have a dataframe with one column containing Tuple[int, int]
values. How do I run df.query on these values? It works as expected when I run this on primitive types like int, but doesn't seem to work on tuple values.
>>> df
a b
0 (1, 2) 11
1 (3, 4) 22
2 (5, 6) 33
>>> df.query('b==11')
a b
0 (1, 2) 11
>>> df.query('a==(1,2)')
Empty DataFrame
Columns: [a, b]
Index: []
Why is my second call not working? it should return the first row right? What should be the correct way to filter pandas dataframe in this case?
CodePudding user response:
You can use .loc
to filter the rows, as follows:
df.loc[df['a'] == (1, 2)]
Result:
a b
0 (1, 2) 11
Or, if you want to keep on using .query()
, you can do it this way:
Define a variable for the tuple being tested and quote the variable inside .query()
, as follows:
cond = (1, 2)
df.query('a== @cond')
or use tuple
within .query()
, as follows:
df.query('a==tuple((1, 2))') # note the double pairs of brackets used for `tuple`
Performance Comparison
In terms of system performance .loc
would be faster than .query()
. Here's some benchmarking results:
%%timeit
df.loc[df['a'] == (1, 2)]
327 µs ± 40 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
cond = (1, 2)
df.query('a== @cond')
1.3 ms ± 51.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.query('a==tuple((1, 2))')
1.49 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
327 µs vs (1.3 ms or 1.49 ms). .loc
is 4 or 4.5 times faster.
For larger size dataframe of 10000x times the sample data:
df2 = pd.concat([df] * 10000)
%%timeit
df2.loc[df['a'] == (1, 2)]
1.59 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
cond = (1, 2)
df2.query('a== @cond')
4.27 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df2.query('a==tuple((1, 2))')
4.32 ms ± 47.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.59 ms vs 4.27 ms .loc
is still around 2.7 times faster for large dataset