I have a dataframe with multiple column. I want to have a new column based on some threshold for a s-CodePudding

I am having a similar dataframe like below.

    group_name  field_A
0   g1  3
1   g1  49
2   g1  90
3   g2  78
4   g2  92
5   g2  76
6   g2  48
7   g3  33
8   g3  44
9   g3  55

I want to process above dataframe and check group wise values and if the value of field_A is above 75%tile of the all group values then add 'Yes' in a new column which is 'isAboveThreshold'. so it will look like

group_name  field_A top isTop75
0   g1  3   69.5    No
1   g1  49  69.5    No
2   g1  90  69.5    Yes
3   g2  78  81.5    No
4   g2  92  81.5    Yes
5   g2  76  81.5    No
6   g2  48  81.5    No
7   g3  33  49.5    No
8   g3  44  49.5    No
9   g3  55  49.5    Yes

CodePudding user response：

Use GroupBy.transform with lambda function and Series.quantile, compare by Series.ge and pass to numpy.where:

s = df.groupby('group_name')['field_A'].transform(lambda x: x.quantile(.75))
df['isTop75'] = np.where(df['field_A'].ge(s), 'Yes', 'No')
print(df)
  group_name  field_A isTop75
0         g1        3      No
1         g1       49      No
2         g1       90     Yes
3         g2       78      No
4         g2       92     Yes
5         g2       76      No
6         g2       48      No
7         g3       33      No
8         g3       44      No
9         g3       55     Yes

CodePudding user response：

Here is a useful function for that, which can be passed to transform or agg after groupby:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'group_name' : ['g1', 'g1', 'g1', 'g2', 'g2', 'g2', 'g2', 'g3', 'g3', 'g3'],
    'field_A': [3, 49, 90, 78, 92, 76, 48, 33, 44, 55]})

def percentile(n):
    """Percentile function that can be passed to pandas agg() function
    AND changes column name.
    """
    def percentile_(x):
        return x.quantile(n)
    percentile_.__name__ = 'q%s' % n
    return percentile_

df['top'] = df.groupby('group_name').transform(percentile(.75))
df['isTop75'] = np.where(df.field_A >= df.top, 'Yes', 'No')
print(df)

output:

  group_name  field_A   top isTop75
0         g1        3  69.5      No
1         g1       49  69.5      No
2         g1       90  69.5     Yes
3         g2       78  81.5      No
4         g2       92  81.5     Yes
5         g2       76  81.5      No
6         g2       48  81.5      No
7         g3       33  49.5      No
8         g3       44  49.5      No
9         g3       55  49.5     Yes

CodePudding user response：

There is a quantile method in pandas. Apply it after grouping your df by group_name:

new_df = pd.DataFrame()

for _, group in df.groupby("group_name"):
    group["top"] = group["field_A"].quantile(q=0.75)
    group["isTop75"] = group.apply(lambda row: "Yes" if row["field_A"] > row["top"] else "No", axis=1)
    new_df = pd.concat([new_df, group])

print(new_df)

Output:

  group_name  field_A   top isTop75
0         g1        3  69.5      No
1         g1       49  69.5      No
2         g1       90  69.5     Yes
3         g2       78  81.5      No
4         g2       92  81.5     Yes
5         g2       76  81.5      No
6         g2       48  81.5      No
7         g3       33  49.5      No
8         g3       44  49.5      No
9         g3       55  49.5     Yes

Edit: You can get get rid of the loop by using transform:

df["top"] = df.groupby("group_name")["field_A"].transform(lambda x: x.quantile(.75))
df["isTop75"] = df.apply(lambda row: "Yes" if row["field_A"] > row["top"] else "No", axis=1)
print(df)