df.query not working when values are tuples in pandas dataframe-CodePudding

I have a dataframe with one column containing Tuple[int, int] values. How do I run df.query on these values? It works as expected when I run this on primitive types like int, but doesn't seem to work on tuple values.

>>> df
        a   b
0  (1, 2)  11
1  (3, 4)  22
2  (5, 6)  33
>>> df.query('b==11')
        a   b
0  (1, 2)  11
>>> df.query('a==(1,2)')
Empty DataFrame
Columns: [a, b]
Index: []

Why is my second call not working? it should return the first row right? What should be the correct way to filter pandas dataframe in this case?

CodePudding user response：

You can use .loc to filter the rows, as follows:

df.loc[df['a'] == (1, 2)]

Result:

        a   b
0  (1, 2)  11

Or, if you want to keep on using .query(), you can do it this way:

Define a variable for the tuple being tested and quote the variable inside .query(), as follows:

cond = (1, 2)
df.query('a== @cond')

or use tuple within .query(), as follows:

df.query('a==tuple((1, 2))')      # note the double pairs of brackets used for `tuple`

Performance Comparison

In terms of system performance .loc would be faster than .query(). Here's some benchmarking results:

%%timeit
df.loc[df['a'] == (1, 2)]

327 µs ± 40 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
cond = (1, 2)
df.query('a== @cond')

1.3 ms ± 51.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.query('a==tuple((1, 2))')

1.49 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

327 µs vs (1.3 ms or 1.49 ms). .loc is 4 or 4.5 times faster.

For larger size dataframe of 10000x times the sample data:

df2 = pd.concat([df] * 10000)

%%timeit
df2.loc[df['a'] == (1, 2)]

1.59 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
cond = (1, 2)
df2.query('a== @cond')

4.27 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df2.query('a==tuple((1, 2))')

4.32 ms ± 47.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1.59 ms vs 4.27 ms .loc is still around 2.7 times faster for large dataset