this question was previously asked (and then deleted) by an user, I was looking to find a solution so I could give out an answer when the question disappeared and I, moreover, can't seem to make sense of pandas' behaviour so I would appreciate some clarity, the original question stated something along the lines of:
How can I replace every negative value except those in a given list with NaN in a Pandas dataframe?
my setup to reproduce the scenario is the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A' : [x for x in range(4)],
'B' : [x for x in range(-2, 2)]
})
this should technically only be an issue of correctly passing a boolean expression to pd.where, my attemped solution looks like:
df[df >= 0 | df.isin([-2])]
which produces:
index | A | B |
---|---|---|
0 | 0 | NaN |
1 | 1 | NaN |
2 | 2 | 0 |
3 | 3 | 1 |
which also cancels the number in the list!
moreover if I mask the dataframe with each of the two conditions I get the correct behavior:
with df[df >= 0]
(identical to the compound result)
index | A | B |
---|---|---|
0 | 0 | NaN |
1 | 1 | NaN |
2 | 2 | 0 |
3 | 3 | 1 |
with df[df.isin([-2])]
(identical to the compound result)
index | A | B |
---|---|---|
0 | NaN | -2.0 |
1 | NaN | NaN |
2 | NaN | NaN |
3 | NaN | NaN |
So it seems like I am
- Running into some undefined behaviour as a result of performing logic on NaN values
- I have got something wrong
Anyone can clarify this situation to me?
CodePudding user response:
Solution
df[(df >= 0) | (df.isin([-2]))]
Explanation
In python, bitwise OR, |
, has a higher operator precedence than comparison operators like >=
: https://docs.python.org/3/reference/expressions.html#operator-precedence
When filtering a pandas DataFrame on multiple boolean conditions, you need to enclose each condition in parentheses. More from the boolean indexing section of the pandas user guide:
Another common operation is the use of boolean vectors to filter the data. The operators are:
|
foror
,&
forand
, and~
fornot
. These must be grouped by using parentheses, since by default Python will evaluate an expression such asdf['A'] > 2 & df['B'] < 3
asdf['A'] > (2 & df['B']) < 3
, while the desired evaluation order is(df['A'] > 2) & (df['B'] < 3)
.