Pandas: How to select the entry with the lowest mean per group?-CodePudding

Suppose I have the following dataframe

df = pd.DataFrame([
    (2, 2, 'A', .5),
    (2, 2, 'A', .6),
    (2, 2, 'B', .75),
    (2, 2, 'B', .7),
    (2, 2, 'C', .6),

    (2, 3, 'A', .65),
    (2, 3, 'A', .6),
    (2, 3, 'B', .75),
    (2, 3, 'B', .7),
    (2, 3, 'C', .6)
], columns=['out_size', 'problem_size', 'algo', 'time'])

I want to

group by `[out_size', 'problem_size', 'algo'], and for each group
~~count the number of occurrences for each algo, then~~
select/keep the algo that has the lowest average time in that group,

result

pd.DataFrame(
      [[2, 2, 'A', 0.55],
       [2, 3, 'C', 0.6]], columns=['out_size', 'problem_size', 'algo', 'time'])

CodePudding user response：

You can use a double groupby:

cols = ['out_size', 'problem_size', 'algo']

out = (df
 .groupby(cols, as_index=False)['time'].mean()
 .sort_values(by='time')
 .groupby(cols[:-1], as_index=False).first()
)

Slightly more efficient alternative that doesn't require to sort the values (but requires to store an intermediate):

cols = ['out_size', 'problem_size', 'algo']

out = df.groupby(cols)['time'].mean()
out = out.loc[out.groupby(cols[:-1]).idxmin()].reset_index()

output:

   out_size  problem_size algo  time
0         2             2    A  0.55
1         2             3    C  0.60

CodePudding user response：

You can do this with two groupbys:

group_cols = ["out_size", "problem_size", "algo"]
result = df.groupby(group_cols)["time"].mean().reset_index(drop=False)

computes the mean time for each group. Then you can sort the time in ascending order:

result = result.sort_values("time", ascending=True)

and then group again (without column algo) and take the first element of each group:

result = result.groupby(["out_size", "problem_size"])[["algo","time"]].first().reset_index(drop=False)

CodePudding user response：

You can do - first groupby the three columns and then agg by mean and then sort ascending and then you want to take only the first in the groups, so drop duplicates based on out_size and problem_size and keeping only the first which will guarantee you get the lowest average algo because it was already sorted.

df.groupby(['out_size', 'problem_size', 'algo'],as_index=False).agg(
    {'time':'mean'}).sort_values(
    by='time',ignore_index=True).drop_duplicates(
    subset=['out_size', 'problem_size'], keep='first', ignore_index=True)

Or if you have more columns that all need average agg, then

agg_cols = ['time', #someother columns]
groupby_cols = df.columns.drop(agg_cols).tolist()
g = df.groupby(groupby_cols,as_index=False).agg('mean')
g.sort_values(by='time',ignore_index=True).drop_duplicates(
               subset=['out_size','problem_size'],ignore_index=True, keep='first')

output:

   out_size  problem_size algo  time
0         2             2    A  0.55
1         2             3    C  0.60

CodePudding user response：

You can do in one line by - grouping values, sorting & grouping again:

df1 = df.groupby(['out_size', 'problem_size', 'algo'])['time'].mean().reset_index()
df1 = df1.sort_values(['out_size', 'problem_size', 'time'])
df1 = df1.groupby(['out_size', 'problem_size']).head(1)
df1

Output:

   out_size  problem_size algo  time
      2             2      A     0.55
      2             3      C     0.60

CodePudding user response：

cols = ['out_size', 'problem_size', 'algo']
out = df.groupby(cols)['time'].mean().unstack('algo').agg(['idxmin', 'min'], axis=1).reset_index()
out.rename(columns={'idxmin':'algo', 'min':'time'}, inplace=True)
print(out)

Output:

   out_size  problem_size algo  time
0         2             2    A  0.55
1         2             3    C   0.6