I have two dataframes, both with a timestamp and some measurement; let's say measurement X and Y. I'd like to join these on their date/time index to compare the measurements but their timestampss are not exactly the same. Furthermore, the X measurement is taken more often than Y, so when joining, the Y value should only be appended to the row where the timestamp of X is closest and the other rows of X may be discarded.
These X and Y dataframes would look something like this:
x = pd.DataFrame([[datetime(year=2022, month=9, day=6, hour=13, minute=13, second=13), 4],
[datetime(year=2022, month=9, day=6, hour=13, minute=15, second=24), 3],
[datetime(year=2022, month=9, day=6, hour=13, minute=18, second=36), 2],
[datetime(year=2022, month=9, day=6, hour=13, minute=21, second=56), 1]], columns=['time', 'x'])
y = pd.DataFrame([[datetime(year=2022, month=9, day=6, hour=13, minute=13, second=59), 14],
[datetime(year=2022, month=9, day=6, hour=13, minute=19, second=10), 13]], columns=['time', 'y'])
And the joined outcome based on these examples is supposed to look something like this:
joined = pd.DataFrame([[datetime(year=2022, month=9, day=6, hour=13, minute=13, second=13),
datetime(year=2022, month=9, day=6, hour=13, minute=13, second=59),
4, 14],
[datetime(year=2022, month=9, day=6, hour=13, minute=18, second=36),
datetime(year=2022, month=9, day=6, hour=13, minute=19, second=10),
2, 13]],
columns=['time_x', 'time_y', 'x', 'y'])
Can anyone help me out with this problem?
CodePudding user response:
Use pandas.merge_asof
after renaming the columns:
out = pd.merge_asof(y.rename(columns={'time': 'time_y'}),
x.rename(columns={'time': 'time_x'}),
left_on='time_y', right_on='time_x',
direction='nearest' # or leave the default backward depending on your use case
)
output:
time_y y time_x x
0 2022-09-06 13:13:59 14 2022-09-06 13:13:13 4
1 2022-09-06 13:19:10 13 2022-09-06 13:18:36 2
CodePudding user response:
Use merge_asof
with direction='nearest'
, then remove repeated rows:
df = pd.merge_asof(x.rename(columns={'time': 'time_x'}),
y.rename(columns={'time':'time_y'}),
left_on='time_x', right_on='time_y', direction='nearest')
df = df.drop_duplicates(['time_y','y'])[['time_x','time_y','x','y']]
print (df)
time_x time_y x y
0 2022-09-06 13:13:13 2022-09-06 13:13:59 4 14
2 2022-09-06 13:18:36 2022-09-06 13:19:10 2 13