I have two dataframes of the form:
df1:
note_id start_sentence end_sentence
0 476766 328 452
1 476766 941 1065
2 500941 377 522
3 500941 797 963
4 500941 1722 1917
and
df2:
note_id start end
0 476766 300 327
1 500941 700 796
I want to merge these such that the join is on note_id
and the difference between start
and start_sentence
is minimized and start_sentence - start
is positive; ideally, something like
df1.merge(df2, on=['note_id']).query('min(start_sentence - start)&(start_sentence > start)')
but, this of course does not work.
The end goal then would be a full outer join like:
note_id start_sentence end_sentence start end
0 476766 328 452 300 327
1 476766 941 1065 NA NA
2 500941 377 522 NA NA
3 500941 797 963 700 796
4 500941 1722 1917 NA NA
I know how to do this using an iterable, but since I have 100s of thousands of rows, this is slow.
CodePudding user response:
I would just do an inner join, and then use all of your conditions to update the values.
Specifically calculate the diff where the diff is >0, and then fill everything but the min diff per group with np.nan
import pandas as pd
df1 = pd.DataFrame({'note_id': [476766, 476766, 500941, 500941, 500941],
'start_sentence': [328, 941, 377, 797, 1722],
'end_sentence': [452, 1065, 522, 963, 1917]})
df2 = pd.DataFrame({'note_id': [476766, 500941], 'start': [300, 700], 'end': [327, 796]})
df = pd.merge(df1,df2, on='note_id')
df.loc[(df['start_sentence']-df['start']).gt(0),'diff'] = df['start_sentence']-df['start']
df.loc[~df.index.isin(df.groupby('note_id')['diff'].idxmin()), ['start','end']] = np.nan
df.drop(columns='diff', inplace=True)
print(df)
Output
note_id start_sentence end_sentence start end
0 476766 328 452 300.0 327.0
1 476766 941 1065 NaN NaN
2 500941 377 522 NaN NaN
3 500941 797 963 700.0 796.0
4 500941 1722 1917 NaN NaN