Home > Blockchain >  How to do merge of these dataframes
How to do merge of these dataframes

Time:04-15

I have two dataframes of the form:

df1: 
     note_id  start_sentence  end_sentence  
0    476766             328           452   
1    476766             941          1065   
2    500941             377           522   
3    500941             797           963   
4    500941            1722          1917

and

df2:
    note_id  start  end  
0    476766  300    327   
1    500941  700    796

I want to merge these such that the join is on note_id and the difference between start and start_sentence is minimized and start_sentence - start is positive; ideally, something like

df1.merge(df2, on=['note_id']).query('min(start_sentence - start)&(start_sentence > start)')

but, this of course does not work.

The end goal then would be a full outer join like:

  note_id  start_sentence  end_sentence    start      end
0    476766             328           452    300      327
1    476766             941          1065     NA       NA
2    500941             377           522     NA       NA
3    500941             797           963    700      796
4    500941            1722          1917     NA       NA

I know how to do this using an iterable, but since I have 100s of thousands of rows, this is slow.

CodePudding user response:

I would just do an inner join, and then use all of your conditions to update the values.

Specifically calculate the diff where the diff is >0, and then fill everything but the min diff per group with np.nan

import pandas as pd

df1 = pd.DataFrame({'note_id': [476766, 476766, 500941, 500941, 500941],
 'start_sentence': [328, 941, 377, 797, 1722],
 'end_sentence': [452, 1065, 522, 963, 1917]})

df2 = pd.DataFrame({'note_id': [476766, 500941], 'start': [300, 700], 'end': [327, 796]})

df = pd.merge(df1,df2, on='note_id')

df.loc[(df['start_sentence']-df['start']).gt(0),'diff'] = df['start_sentence']-df['start']
df.loc[~df.index.isin(df.groupby('note_id')['diff'].idxmin()), ['start','end']] = np.nan
df.drop(columns='diff', inplace=True)

print(df)

Output

   note_id  start_sentence  end_sentence  start    end
0   476766             328           452  300.0  327.0
1   476766             941          1065    NaN    NaN
2   500941             377           522    NaN    NaN
3   500941             797           963  700.0  796.0
4   500941            1722          1917    NaN    NaN
  • Related