Home > database >  Lambda function inside DataFrame Apply
Lambda function inside DataFrame Apply

Time:11-11

I have the following DataFrame in Pandas:

import pandas as pd
import numpy as np

df = pd.DataFrame([(1, 1, 1, 0),
                   (2, 0, 0, 2),
                   (3, 0, 1, 3),
                   (4, 5, 3, 0)],
                  columns=list('abcd'))

I need to implement the following function into that DataFrame:

enter image description here

I'm trying to use the apply() function below:

dfs = df.apply(lambda x: np.mean(x) 2*np.std(x) if x > np.mean(x) 2*np.std(x) else x, axis = 0, result_type='broadcast')
dfs

I'm getting the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Not really sure what it means, or where should i use those a.empty, a.bool()... to fix it.

CodePudding user response:

If you want to check row by row then you can use np.where instead of if else in your program. First parameter is your condition. When it is true it takes the second parameter at the same index. If it is wrong it takes the third parameter at the same index.

df.apply(lambda x:np.where(x > np.mean(x) 2*np.std(x), np.mean(x) 2*np.std(x), x), axis=0)

CodePudding user response:

You can use clip after calculating the mean and std on whole dataframe at once.

df.clip(upper=df.mean() 2*df.std(), axis=1)

with the current input, it does not change anything, here is a way to see it:

# calcualte the current upper bound
_upper = df.mean()   2*df.std()
print(_upper)
# a    5.081989
# b    6.260952
# c    3.766611
# d    4.250000
# dtype: float64

# then replace two values above the bound
df.loc[2,['a','b']] = [12,9]
print(df)
# dtype: float64
#     a  b  c  d
# 0   1  1  1  0
# 1   2  0  0  2
# 2  12  9  1  3 # see the values in column a and b
# 3   4  5  3  0

# see what clip does for the values in column a and b, index 2
print(df.clip(upper=_upper, axis=1))
#           a         b  c  d
# 0  1.000000  1.000000  1  0
# 1  2.000000  0.000000  0  2
# 2  5.081989  6.260952  1  3  # 12 and 9 replaced by the upper bound of the column
# 3  4.000000  5.000000  3  0
  • Related