Here is an example of two dataframes in pandas:
df = pd.DataFrame({0: [1.5, 2, 2.3, 4], 1: [3, 6, 1, 0]})
df2 = pd.DataFrame({0: [1.7, 4.05, 2.1, 2.99], 1: [1, 3, 1, 7]})
df.columns = ["x1", "y1"]
df2.columns = ["x2", "y2"]
Then I merge them and make another dataframe which consists only of x1
and x2
columns that are closest to each other.
merged1 = df.merge(df2, how='cross')
merged1['diff'] = (merged1['x1'].sub(merged1['x2'])).abs()
out1 = (merged1.loc[merged1.groupby(df.columns.tolist())['diff'].idxmin().to_numpy()])
And lastly, I'd like to group by column y2
and print out the row of each group that has the smallest value of the column diff
out1.groupby(["y2"])['diff'].min()
But it prints out all the rows:
x1 y1 x2 y2 diff
0 1.5 3 1.70 1 0.20
6 2.0 6 2.10 1 0.10
10 2.3 1 2.10 1 0.20
13 4.0 0 4.05 3 0.05
What's the problem?
CodePudding user response:
You need to use:
out1.loc[out1.groupby(["y2"])['diff'].idxmin()]
output:
x1 y1 x2 y2 diff
6 2.0 6 2.10 1 0.10
13 4.0 0 4.05 3 0.05
Note that you could do the whole thing with just 2 commands (using pandas.merge_asof
):
(pd
.merge_asof(df, df2.sort_values(by='x2'),
left_on='x1', right_on='x2', direction='nearest')
.loc[lambda d: d['x1'].sub(d['x2']).abs().groupby(d['y2']).idxmin()]
)
or, to also have the "diff" column:
(pd
.merge_asof(df, df2.sort_values(by='x2'),
left_on='x1', right_on='x2', direction='nearest')
.assign(diff=lambda d: d['x1'].sub(d['x2']).abs())
.loc[lambda d: d.groupby('y2')['diff'].idxmin()]
)