I'd like to group by two columns and then print out rows corresponding to minimum value on the-CodePudding

Here is an example of two dataframes in pandas:

 df = pd.DataFrame({0: [1.5, 2, 2.3, 4], 1: [3, 6, 1, 0]})
 df2 = pd.DataFrame({0: [1.7, 4.05, 2.1, 2.99], 1: [1, 3, 1, 7]})

df.columns = ["x1", "y1"]
df2.columns = ["x2", "y2"]

Then I merge them and make another dataframe which consists only of x1 and x2 columns that are closest to each other.

merged1 = df.merge(df2, how='cross')
merged1['diff'] = (merged1['x1'].sub(merged1['x2'])).abs()

out1 = (merged1.loc[merged1.groupby(df.columns.tolist())['diff'].idxmin().to_numpy()])

And lastly, I'd like to group by column y2 and print out the row of each group that has the smallest value of the column diff

out1.groupby(["y2"])['diff'].min()

But it prints out all the rows:

    x1  y1    x2  y2  diff
0   1.5   3  1.70   1  0.20
6   2.0   6  2.10   1  0.10
10  2.3   1  2.10   1  0.20
13  4.0   0  4.05   3  0.05

What's the problem?

CodePudding user response：

You need to use:

out1.loc[out1.groupby(["y2"])['diff'].idxmin()]

output:

     x1  y1    x2  y2  diff
6   2.0   6  2.10   1  0.10
13  4.0   0  4.05   3  0.05

Note that you could do the whole thing with just 2 commands (using pandas.merge_asof):

(pd
 .merge_asof(df, df2.sort_values(by='x2'),
             left_on='x1', right_on='x2', direction='nearest')
 .loc[lambda d: d['x1'].sub(d['x2']).abs().groupby(d['y2']).idxmin()]
)

or, to also have the "diff" column:

(pd
 .merge_asof(df, df2.sort_values(by='x2'),
             left_on='x1', right_on='x2', direction='nearest')
 .assign(diff=lambda d: d['x1'].sub(d['x2']).abs())
 .loc[lambda d: d.groupby('y2')['diff'].idxmin()]
)