Joining values indexed by date and time with irregular measurements-CodePudding

I have two dataframes, both with a timestamp and some measurement; let's say measurement X and Y. I'd like to join these on their date/time index to compare the measurements but their timestampss are not exactly the same. Furthermore, the X measurement is taken more often than Y, so when joining, the Y value should only be appended to the row where the timestamp of X is closest and the other rows of X may be discarded.

These X and Y dataframes would look something like this:

x = pd.DataFrame([[datetime(year=2022, month=9, day=6, hour=13, minute=13, second=13), 4], 
                  [datetime(year=2022, month=9, day=6, hour=13, minute=15, second=24), 3],
                  [datetime(year=2022, month=9, day=6, hour=13, minute=18, second=36), 2],
                  [datetime(year=2022, month=9, day=6, hour=13, minute=21, second=56), 1]], columns=['time', 'x'])

y = pd.DataFrame([[datetime(year=2022, month=9, day=6, hour=13, minute=13, second=59), 14], 
                  [datetime(year=2022, month=9, day=6, hour=13, minute=19, second=10), 13]], columns=['time', 'y'])

And the joined outcome based on these examples is supposed to look something like this:

joined = pd.DataFrame([[datetime(year=2022, month=9, day=6, hour=13, minute=13, second=13), 
                        datetime(year=2022, month=9, day=6, hour=13, minute=13, second=59), 
                        4, 14], 
                       [datetime(year=2022, month=9, day=6, hour=13, minute=18, second=36), 
                        datetime(year=2022, month=9, day=6, hour=13, minute=19, second=10), 
                        2, 13]], 
                      columns=['time_x', 'time_y', 'x', 'y'])

Can anyone help me out with this problem?

CodePudding user response：

Use pandas.merge_asof after renaming the columns:

out = pd.merge_asof(y.rename(columns={'time': 'time_y'}),
                    x.rename(columns={'time': 'time_x'}),
                    left_on='time_y', right_on='time_x',
                    direction='nearest' # or leave the default backward depending on your use case
                   )

output:

               time_y   y              time_x  x
0 2022-09-06 13:13:59  14 2022-09-06 13:13:13  4
1 2022-09-06 13:19:10  13 2022-09-06 13:18:36  2

CodePudding user response：

Use merge_asof with direction='nearest', then remove repeated rows:

df = pd.merge_asof(x.rename(columns={'time': 'time_x'}), 
                   y.rename(columns={'time':'time_y'}), 
                   left_on='time_x', right_on='time_y', direction='nearest')


df = df.drop_duplicates(['time_y','y'])[['time_x','time_y','x','y']]
print (df)
               time_x              time_y  x   y
0 2022-09-06 13:13:13 2022-09-06 13:13:59  4  14
2 2022-09-06 13:18:36 2022-09-06 13:19:10  2  13