Imagine I have the following dataframe, that was produced by logging some kind of events, categorized by the "clazz" column.
test_data = [
[1, "A", "x", "WH", "2021-10-01 04:58:27"],
[2, "A", "x", "OR", "2021-10-01 04:58:28"],
[3, "B", "x", "WH", "2021-10-01 04:58:27"],
[4, "B", "x", "OR", "2021-10-01 04:58:27"],
[5, "A", "y", "WH", "2021-10-01 04:58:27"],
[6, "A", "y", "OR", "2021-10-01 04:58:27"]
]
test_df = pd.DataFrame(test_data, columns=["id", "cat1", "cat2", "clazz", "date"])
There are two classes, WH and OR and they logically belong together if cat1 and cat2 are equal and they happened at the same time.
Normally I would merge them like follows:
wh = test_data[test_data.clazz == "WH"]
oh = test_data[test_data.clazz == "OH"]
wh.merge(right=oh, how="left", on=["cat1", "cat2", "date"])
Unfortunately, due to some inaccuracies in recording the data, once in a while the dates differ by around a second. Therefore, the lines with id 1 and 2, which would belong together, will not be matched.
Is there an easy and performant way I can somehow correct this while merging?
I found this post here is it possible to do fuzzy match merge with python pandas? that goes in a similar direction. There I tried creating a joined column, consisting of a concatenation of cat1, cat2 and date and try to correct the values like follows:
oh["fuzz"] = (oh.cat1 oh.cat2 oh.str.date)
wh["fuzz"] = (wh.cat1 wh.cat2 wh.str.date)
oh.fuzz = oh.fuzz.map(lambda x: difflib.get_close_matches(x, wh.fuzz, n=2)[0])
This works, but it unfortunately is extremely slow. I broke down my example to a really short period of time, where in wh only remaind around 4000 entries and executing the get_close_matches for a single line took around 500ms which is not applicable for my dataset of around 3 millions.
Thank you
CodePudding user response:
You can use merge_asof
:
test_df['date'] = pd.to_datetime(test_df['date'])
test_df = test_df.sort_values('date')
wh = test_df[test_df.clazz == "WH"]
oh = test_df[test_df.clazz == "OR"]
df = pd.merge_asof(wh,
oh,
by=["cat1", "cat2"],
on='date',
tolerance = pd.Timedelta(1, unit='s'),
direction='nearest')
print (df)
id_x cat1 cat2 clazz_x date id_y clazz_y
0 1 A x WH 2021-10-01 04:58:27 2 OR
1 3 B x WH 2021-10-01 04:58:27 4 OR
2 5 A y WH 2021-10-01 04:58:27 6 OR