Pandas: Filter dataframe using if statement and variables inside function-CodePudding

Say I have a dataframe (df_1) with "high" column limits defined in terms of percentiles like so:

df_1:
A   B   C
72  57  61
3   52  32
64  51  93
71  79  91
25  23  31
91  90  1
43  44  84
98  91  8
11  49  38
65  33  59

# High Percentiles defined for each column as a variable

A_High = np.percentile(df_1['A'], 60)
B_High = np.percentile(df_1['B'], 60)
C_High = np.percentile(df_1['C'], 60)

#The output for the above variables:
# A_High = 67.39999999999999
# B_High = 54.0
# C_High =59.8

I am trying to create a function that uses the percentile variables above and an IF statement with multiple conditions to filter results of the dataframe. The filtered results are then compiled as 1 or 0 in a separate column called Alerts. I've been able to perform this successfully using the following code:

def Alert(df):
    if df['A'] >= 67.39999999999999 and df['B'] >= 54.0 and df['C'] >= 59.8:
        return 1
    else:
        return 0

df_1.insert(3, 'Alert', df_1.apply(Alert, axis=1))

OUTPUT:
A   B   C   Alert
72  57  61  1
3   52  32  0
64  51  93  0
71  79  91  1
25  23  31  0
91  90  1   0
43  44  84  0
98  91  8   0
11  49  38  0
65  33  59  0

But when I define the percentiles within the function itself then it does not generate the desired output.

def Alert(df):
    A_High = np.percentile(df['A'], 60)
    B_High = np.percentile(df['B'], 60)
    C_High = np.percentile(df['C'], 60)
    if df['A'] >= A_High and df['B'] >= B_High and df['C'] >= C_High:
        return 1
    else:
        return 0

df_1.insert(3, 'Alert', df_1.apply(Alert, axis=1))

OUTPUT:
A   B   C   Alert
72  57  61  1
3   52  32  1
64  51  93  1
71  79  91  1
25  23  31  1
91  90  1   1
43  44  84  1
98  91  8   1
11  49  38  1
65  33  59  1

I want to run this function over different dataframes and don't want to manually enter the high limit values every time. Would appreciate if someone could help me out here (FYI: New to Python)

CodePudding user response：

Pandas apply function iterates over the rows of dataframe so the calculated values for A_High B_High and C_High will change for each iteration you can create different function which returns those values and pass the values to the Alert function

Example:

def percentiles(df):
    A_High = np.percentile(df['A'], 60)
    B_High = np.percentile(df['B'], 60)
    C_High = np.percentile(df['C'], 60)
    return A_High,B_High,C_High
def Alert(df,A_High,B_High,C_High):
    if df['A'] >= A_High and df['B'] >= B_High and df['C'] >= C_High:
        return 1
    else:
        return 0
A_High,B_High,C_High=percentiles(df_1)
df_1.insert(3, 'Alert', df_1.apply(Alert, axis=1,args=A_High,B_High,C_High))