I have a DataFrame as follows:
data = [[99330,12,122],
[1123,1230,1287],
[123,101,812739],
[1143,12301230,252]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
for col in df1.columns:
df1[col '_mean'] = df1[col].rolling(1).mean().shift()
df1[col '_std'] = df1[col].rolling(1).std().shift()
df1[col '_upper'] = df1[col '_mean'] df1[col '_std']
df1[col '_lower'] = df1[col '_mean'] - df1[col '_std']
df1[col '_outlier'] = np.where(df1[col]>df1[col '_upper'] or df1[col]<df1[col '_lower'], 1, 0)
However, the last line gives an error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I want to get a column col '_outlier'
which displays 1
if df1[col]>df1[col '_upper']
or if df1[col]<df1[col '_lower']
; and display 0
otherwise.
What's the proper way to write this where clause with two conditions?
CodePudding user response:
Have a look at the operater precedence table in the official documentation. Highest precedence from top to bottom.
You need to wrap your condition in parenthesis and use pipe |
instead of or
.
df1[col '_outlier'] = np.where( (df1[col]>df1[col '_upper']) | (df1[col]<df1[col '_lower']) , 1, 0)