Suppose I have the following dataframe
df = pd.DataFrame([
(2, 2, 'A', .5),
(2, 2, 'A', .6),
(2, 2, 'B', .75),
(2, 2, 'B', .7),
(2, 2, 'C', .6),
(2, 3, 'A', .65),
(2, 3, 'A', .6),
(2, 3, 'B', .75),
(2, 3, 'B', .7),
(2, 3, 'C', .6)
], columns=['out_size', 'problem_size', 'algo', 'time'])
I want to
- group by `[out_size', 'problem_size', 'algo'], and for each group
count the number of occurrences for eachalgo
, then- select/keep the
algo
that has the lowest average time in that group,
result
pd.DataFrame(
[[2, 2, 'A', 0.55],
[2, 3, 'C', 0.6]], columns=['out_size', 'problem_size', 'algo', 'time'])
CodePudding user response:
You can use a double groupby
:
cols = ['out_size', 'problem_size', 'algo']
out = (df
.groupby(cols, as_index=False)['time'].mean()
.sort_values(by='time')
.groupby(cols[:-1], as_index=False).first()
)
Slightly more efficient alternative that doesn't require to sort the values (but requires to store an intermediate):
cols = ['out_size', 'problem_size', 'algo']
out = df.groupby(cols)['time'].mean()
out = out.loc[out.groupby(cols[:-1]).idxmin()].reset_index()
output:
out_size problem_size algo time
0 2 2 A 0.55
1 2 3 C 0.60
CodePudding user response:
You can do this with two groupby
s:
group_cols = ["out_size", "problem_size", "algo"]
result = df.groupby(group_cols)["time"].mean().reset_index(drop=False)
computes the mean time for each group. Then you can sort the time in ascending order:
result = result.sort_values("time", ascending=True)
and then group again (without column algo
) and take the first element of each group:
result = result.groupby(["out_size", "problem_size"])[["algo","time"]].first().reset_index(drop=False)
CodePudding user response:
You can do - first groupby the three columns and then agg by mean and then sort ascending and then you want to take only the first in the groups, so drop duplicates based on out_size and problem_size and keeping only the first which will guarantee you get the lowest average algo because it was already sorted.
df.groupby(['out_size', 'problem_size', 'algo'],as_index=False).agg(
{'time':'mean'}).sort_values(
by='time',ignore_index=True).drop_duplicates(
subset=['out_size', 'problem_size'], keep='first', ignore_index=True)
Or if you have more columns that all need average agg, then
agg_cols = ['time', #someother columns]
groupby_cols = df.columns.drop(agg_cols).tolist()
g = df.groupby(groupby_cols,as_index=False).agg('mean')
g.sort_values(by='time',ignore_index=True).drop_duplicates(
subset=['out_size','problem_size'],ignore_index=True, keep='first')
output:
out_size problem_size algo time
0 2 2 A 0.55
1 2 3 C 0.60
CodePudding user response:
You can do in one line by - grouping values, sorting & grouping again:
df1 = df.groupby(['out_size', 'problem_size', 'algo'])['time'].mean().reset_index()
df1 = df1.sort_values(['out_size', 'problem_size', 'time'])
df1 = df1.groupby(['out_size', 'problem_size']).head(1)
df1
Output:
out_size problem_size algo time
2 2 A 0.55
2 3 C 0.60
CodePudding user response:
cols = ['out_size', 'problem_size', 'algo']
out = df.groupby(cols)['time'].mean().unstack('algo').agg(['idxmin', 'min'], axis=1).reset_index()
out.rename(columns={'idxmin':'algo', 'min':'time'}, inplace=True)
print(out)
Output:
out_size problem_size algo time
0 2 2 A 0.55
1 2 3 C 0.6