Home > Software design >  Merging many to many Dask
Merging many to many Dask

Time:09-27

say I have the following databases (suppose they are Dask data frames:

df A =

1
1
2
2
2
2
3
4
5
5
5
5
5
5

df B =

1
2
2
3
3
3
4
5
5
5

and I would like to merge the two so that the resulting DataFrame has the most information among the two (so for instance in the case of observation 1 I would like to preserve the info of df A, in case of observation number 3, I would like to preserve the info of df B and iso on...). In other words the resulting DataFrame should be like this:

df C=

1
1
2
2
2
2
3
3
3
4
5
5
5
5
5
5

Is there a way to do that in Dask?

Thank you

CodePudding user response:

IIUC looks like OP wants to use dask.dataframe.multi.merge_asof. Start by importing dask.dataframe and then do the desired merge

import dask.dataframe as dd

df = dd.merge_asof(df_a, df_b, on='sample_id', allow_exact_matches=False)

[Out]:
sample_id
0           1
1           1
2           2
3           2
4           2
5           2
6           3
7           4
8           5
9           5
10          5
11          5
12          5
13          5

Note:

  • Related