Consider the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[1,2,3], "B":['a','b','c']})
df = pd.concat([df]*5, ignore_index=False).sort_index(ascending=True).reset_index(drop=True)
df["C"] = np.arange(1,16)
df["D"] = np.random.uniform(-1., 1., 15)
df["E"] = np.random.uniform(0, 1., 15)
A B C D E
0 1 a 1 -0.763898 0.912174
1 1 a 2 -0.293539 0.595689
2 1 a 3 -0.471699 0.361923
3 1 a 4 0.211997 0.231958
4 1 a 5 0.017619 0.061430
5 2 b 6 -0.537799 0.033919
6 2 b 7 0.699780 0.162702
7 2 b 8 0.117749 0.035086
8 2 b 9 -0.844986 0.028885
9 2 b 10 -0.195802 0.791704
10 3 c 11 0.347316 0.297026
11 3 c 12 -0.003912 0.219278
12 3 c 13 -0.364595 0.531133
13 3 c 14 -0.399493 0.898012
14 3 c 15 0.191196 0.526866
The rows are in 3 groups with respect to columns A
and B
. I also have a vector of size 3, each value for a group in df.
arr = np.array([-0.5, 0.7, 0.])
Now, for each group of rows in the dataframe, if there are some rows whose D
values are greater than the corresponding arr
value, I want to select the row with largest E
value among this subset of rows. If there exists no rows whose D
values are greater than the corresponding arr
value, then I simply select the row with largest E
value. For the given df, what I need to have is:
A B C D E
0 1 a 2 -0.293539 0.595689
1 2 b 10 -0.195802 0.791704
2 3 c 15 0.191196 0.526866
The code that I wrote can give the result for the groups for which some rows have D
value larger than the corresponding arr
value.
df["aux_col"] = np.repeat(arr, 5)
df_tmp = df.sort_values("E", ascending=False).sort_values(["A", "B"])
df_tmp[df_tmp["D"]>df_tmp["aux_col"]].groupby(["A", "B"]).first().reset_index().drop("aux_col", axis=1)
Output:
A B C D E aux_col
0 1 a 2 -0.293539 0.595689 -0.5
1 3 c 15 0.191196 0.526866 0.0
I need a way to also include the groups for which none of the rows has D
value larger than the arr
value.
CodePudding user response:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[1,2,3], "B":['a','b','c']})
df = pd.concat([df]*5, ignore_index=False).sort_index(ascending=True).reset_index(drop=True)
df["C"] = np.arange(1,16)
df["D"] = np.random.uniform(-1., 1., 15)
df["E"] = np.random.uniform(0, 1., 15)
arr = np.array([-0.5, 0.7, 0.])
df_output = df.iloc[:0,:].copy()
for i in range(len(arr)):
df_i = df[df['A'] == i 1]
df_greater = df_i[df_i['D'] > arr[i]]
if df_greater.empty:
df_tmp = df_i.sort_values('E', ascending = False)
else:
df_tmp = df_greater.sort_values('E', ascending = False)
df_output = pd.concat([df_output, df_tmp.iloc[[0]]])
CodePudding user response:
I found this solution:
df["aux_col"] = np.repeat(arr, 5)
df_tmp = df.sort_values("E", ascending=False).sort_values(["A", "B"])
df_tmp["compare"] = df_tmp["D"]>df_tmp["aux_col"]
df_tmp.groupby(["A", "B", "compare"]).first().reset_index().\
groupby(["A", "B"]).first().reset_index().drop(["aux_col", "compare"], axis=1))