I have the following DataFrame in Pandas:
import pandas as pd
import numpy as np
df = pd.DataFrame([(1, 1, 1, 0),
(2, 0, 0, 2),
(3, 0, 1, 3),
(4, 5, 3, 0)],
columns=list('abcd'))
I need to implement the following function into that DataFrame:
I'm trying to use the apply()
function below:
dfs = df.apply(lambda x: np.mean(x) 2*np.std(x) if x > np.mean(x) 2*np.std(x) else x, axis = 0, result_type='broadcast')
dfs
I'm getting the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Not really sure what it means, or where should i use those a.empty, a.bool()...
to fix it.
CodePudding user response:
If you want to check row by row then you can use np.where
instead of if else
in your program. First parameter is your condition. When it is true it takes the second parameter at the same index. If it is wrong it takes the third parameter at the same index.
df.apply(lambda x:np.where(x > np.mean(x) 2*np.std(x), np.mean(x) 2*np.std(x), x), axis=0)
CodePudding user response:
You can use clip
after calculating the mean
and std
on whole dataframe at once.
df.clip(upper=df.mean() 2*df.std(), axis=1)
with the current input, it does not change anything, here is a way to see it:
# calcualte the current upper bound
_upper = df.mean() 2*df.std()
print(_upper)
# a 5.081989
# b 6.260952
# c 3.766611
# d 4.250000
# dtype: float64
# then replace two values above the bound
df.loc[2,['a','b']] = [12,9]
print(df)
# dtype: float64
# a b c d
# 0 1 1 1 0
# 1 2 0 0 2
# 2 12 9 1 3 # see the values in column a and b
# 3 4 5 3 0
# see what clip does for the values in column a and b, index 2
print(df.clip(upper=_upper, axis=1))
# a b c d
# 0 1.000000 1.000000 1 0
# 1 2.000000 0.000000 0 2
# 2 5.081989 6.260952 1 3 # 12 and 9 replaced by the upper bound of the column
# 3 4.000000 5.000000 3 0