Home > Enterprise >  Pandas mask with composite expression behaviour
Pandas mask with composite expression behaviour

Time:04-29

this question was previously asked (and then deleted) by an user, I was looking to find a solution so I could give out an answer when the question disappeared and I, moreover, can't seem to make sense of pandas' behaviour so I would appreciate some clarity, the original question stated something along the lines of:

How can I replace every negative value except those in a given list with NaN in a Pandas dataframe?

my setup to reproduce the scenario is the following:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A' : [x for x in range(4)],
    'B' : [x for x in range(-2, 2)]
})

this should technically only be an issue of correctly passing a boolean expression to pd.where, my attemped solution looks like:

df[df >= 0 | df.isin([-2])] 

which produces:

index A B
0 0 NaN
1 1 NaN
2 2 0
3 3 1

which also cancels the number in the list!

moreover if I mask the dataframe with each of the two conditions I get the correct behavior:

with df[df >= 0] (identical to the compound result)

index A B
0 0 NaN
1 1 NaN
2 2 0
3 3 1

with df[df.isin([-2])] (identical to the compound result)

index A B
0 NaN -2.0
1 NaN NaN
2 NaN NaN
3 NaN NaN

So it seems like I am

  1. Running into some undefined behaviour as a result of performing logic on NaN values
  2. I have got something wrong

Anyone can clarify this situation to me?

CodePudding user response:

Solution

df[(df >= 0) | (df.isin([-2]))] 

Explanation

In python, bitwise OR, |, has a higher operator precedence than comparison operators like >=: https://docs.python.org/3/reference/expressions.html#operator-precedence

When filtering a pandas DataFrame on multiple boolean conditions, you need to enclose each condition in parentheses. More from the boolean indexing section of the pandas user guide:

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).

  • Related