Select rows of Pandas DataFrame in diffrent groups and comparing columns-CodePudding

Consider the following dataframe:


    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({"A":[1,2,3], "B":['a','b','c']})
    df = pd.concat([df]*5, ignore_index=False).sort_index(ascending=True).reset_index(drop=True)
    df["C"] = np.arange(1,16)
    df["D"] = np.random.uniform(-1., 1., 15) 
    df["E"] = np.random.uniform(0, 1., 15)

        A  B   C         D         E
    0   1  a   1 -0.763898  0.912174
    1   1  a   2 -0.293539  0.595689
    2   1  a   3 -0.471699  0.361923
    3   1  a   4  0.211997  0.231958
    4   1  a   5  0.017619  0.061430
    5   2  b   6 -0.537799  0.033919
    6   2  b   7  0.699780  0.162702
    7   2  b   8  0.117749  0.035086
    8   2  b   9 -0.844986  0.028885
    9   2  b  10 -0.195802  0.791704
    10  3  c  11  0.347316  0.297026
    11  3  c  12 -0.003912  0.219278
    12  3  c  13 -0.364595  0.531133
    13  3  c  14 -0.399493  0.898012
    14  3  c  15  0.191196  0.526866

The rows are in 3 groups with respect to columns A and B. I also have a vector of size 3, each value for a group in df.

arr = np.array([-0.5, 0.7, 0.])

Now, for each group of rows in the dataframe, if there are some rows whose D values are greater than the corresponding arr value, I want to select the row with largest E value among this subset of rows. If there exists no rows whose D values are greater than the corresponding arr value, then I simply select the row with largest E value. For the given df, what I need to have is:

   A  B   C         D         E 
0  1  a   2 -0.293539  0.595689 
1  2  b  10 -0.195802  0.791704
2  3  c  15  0.191196  0.526866

The code that I wrote can give the result for the groups for which some rows have D value larger than the corresponding arr value.

df["aux_col"] = np.repeat(arr, 5)

df_tmp = df.sort_values("E", ascending=False).sort_values(["A", "B"])
df_tmp[df_tmp["D"]>df_tmp["aux_col"]].groupby(["A", "B"]).first().reset_index().drop("aux_col", axis=1)

Output:

   A  B   C         D         E        aux_col
0  1  a   2 -0.293539  0.595689           -0.5
1  3  c  15  0.191196  0.526866            0.0

I need a way to also include the groups for which none of the rows has D value larger than the arr value.

CodePudding user response：

import pandas as pd
import numpy as np

df = pd.DataFrame({"A":[1,2,3], "B":['a','b','c']})
df = pd.concat([df]*5, ignore_index=False).sort_index(ascending=True).reset_index(drop=True)
df["C"] = np.arange(1,16)
df["D"] = np.random.uniform(-1., 1., 15) 
df["E"] = np.random.uniform(0, 1., 15)

arr = np.array([-0.5, 0.7, 0.])

df_output = df.iloc[:0,:].copy()

for i in range(len(arr)):
    df_i = df[df['A'] == i   1]
    df_greater = df_i[df_i['D'] > arr[i]]
    if df_greater.empty:
        df_tmp = df_i.sort_values('E', ascending = False)        
    else:
        df_tmp = df_greater.sort_values('E', ascending = False)
    df_output = pd.concat([df_output, df_tmp.iloc[[0]]])

CodePudding user response：

I found this solution:

df["aux_col"] = np.repeat(arr, 5)

df_tmp = df.sort_values("E", ascending=False).sort_values(["A", "B"])
df_tmp["compare"] = df_tmp["D"]>df_tmp["aux_col"]
df_tmp.groupby(["A", "B", "compare"]).first().reset_index().\
      groupby(["A", "B"]).first().reset_index().drop(["aux_col", "compare"], axis=1))