say I have the following databases (suppose they are Dask data frames:
df A =
1
1
2
2
2
2
3
4
5
5
5
5
5
5
df B =
1
2
2
3
3
3
4
5
5
5
and I would like to merge the two so that the resulting DataFrame has the most information among the two (so for instance in the case of observation 1 I would like to preserve the info of df A, in case of observation number 3, I would like to preserve the info of df B and iso on...). In other words the resulting DataFrame should be like this:
df C=
1
1
2
2
2
2
3
3
3
4
5
5
5
5
5
5
Is there a way to do that in Dask?
Thank you
CodePudding user response:
IIUC looks like OP wants to use dask.dataframe.multi.merge_asof
. Start by importing dask.dataframe
and then do the desired merge
import dask.dataframe as dd
df = dd.merge_asof(df_a, df_b, on='sample_id', allow_exact_matches=False)
[Out]:
sample_id
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 4
8 5
9 5
10 5
11 5
12 5
13 5
Note:
There are various ways to merge dask dataframes. Dask provides various built-in modules, such as:
dask.dataframe.DataFrame.join
,dask.dataframe.multi.concat
,dask.dataframe.DataFrame.merge
,dask.dataframe.multi.merge
,dask.dataframe.multi.merge_asof
. Depending on one's requirements one might want to use a specific one.This thread has really valuable information on merges. Even though its focus is on
Pandas
, it will allow one to understandleft
,right
,outer
, andinner
merges.