Home > Software design >  Merging two dataframes with a bit of slack
Merging two dataframes with a bit of slack

Time:11-11

Imagine I have the following dataframe, that was produced by logging some kind of events, categorized by the "clazz" column.

test_data = [
    [1, "A", "x", "WH", "2021-10-01 04:58:27"],
    [2, "A", "x", "OR", "2021-10-01 04:58:28"],
    [3, "B", "x", "WH", "2021-10-01 04:58:27"],
    [4, "B", "x", "OR", "2021-10-01 04:58:27"],
    [5, "A", "y", "WH", "2021-10-01 04:58:27"],
    [6, "A", "y", "OR", "2021-10-01 04:58:27"]
    
]
test_df = pd.DataFrame(test_data, columns=["id", "cat1", "cat2", "clazz", "date"])

There are two classes, WH and OR and they logically belong together if cat1 and cat2 are equal and they happened at the same time.

Normally I would merge them like follows:

wh = test_data[test_data.clazz == "WH"]
oh = test_data[test_data.clazz == "OH"]
wh.merge(right=oh, how="left", on=["cat1", "cat2", "date"])

Unfortunately, due to some inaccuracies in recording the data, once in a while the dates differ by around a second. Therefore, the lines with id 1 and 2, which would belong together, will not be matched.

Is there an easy and performant way I can somehow correct this while merging?

I found this post here is it possible to do fuzzy match merge with python pandas? that goes in a similar direction. There I tried creating a joined column, consisting of a concatenation of cat1, cat2 and date and try to correct the values like follows:

oh["fuzz"] = (oh.cat1   oh.cat2   oh.str.date)
wh["fuzz"] = (wh.cat1   wh.cat2   wh.str.date)
oh.fuzz = oh.fuzz.map(lambda x: difflib.get_close_matches(x, wh.fuzz, n=2)[0])

This works, but it unfortunately is extremely slow. I broke down my example to a really short period of time, where in wh only remaind around 4000 entries and executing the get_close_matches for a single line took around 500ms which is not applicable for my dataset of around 3 millions.

Thank you

CodePudding user response:

You can use merge_asof:

test_df['date'] = pd.to_datetime(test_df['date'])
test_df = test_df.sort_values('date')


wh = test_df[test_df.clazz == "WH"]
oh = test_df[test_df.clazz == "OR"]

df = pd.merge_asof(wh,
                   oh, 
                   by=["cat1", "cat2"], 
                   on='date', 
                   tolerance = pd.Timedelta(1, unit='s'),
                   direction='nearest')
print (df)
   id_x cat1 cat2 clazz_x                date  id_y clazz_y
0     1    A    x      WH 2021-10-01 04:58:27     2      OR
1     3    B    x      WH 2021-10-01 04:58:27     4      OR
2     5    A    y      WH 2021-10-01 04:58:27     6      OR
  • Related