Home > Net >  Merging two dataframes with a hint of common values
Merging two dataframes with a hint of common values

Time:09-24

I'm interested in merging two data frames with below properties:-

1. They do not have any common columns between them, so a direct join is not possible.

2. They are of different sizes. For example df2 has 4 rows and df1 has 3.

3. The point of merger between the two is the date values related to "ArrivalDateCap" in df2 and "ArrivalDateTime" in df1.

The dataframes are below:

df1  = {'ID1': ['A12', 'A13', 'A14'], 'ArrivalDateTime': ["2021-09-20 16:37", "2021-09-21 03:10", "2021-09-26 03:10"]} 
df2  = {'ID': ['001', '002', '003','004'], 'ArrivalDateCap': ["2021-09-20 18:00", "2021-09-21 18:00", "2021-09-20 18:00","2021-09-25 16:00"]}  
  
df1 = pd.DataFrame(df1)  
df1["ArrivalDateTime"] = pd.to_datetime(df1["ArrivalDateTime"],format="%Y-%m-%d %H:%M") 

df2 = pd.DataFrame(df2)
df2["ArrivalDateCap"] = pd.to_datetime(df2["ArrivalDateCap"],format="%Y-%m-%d %H:%M") 

Following point 3 above, column "ArrivalDateTime" is added to df2 if it is the closest and less than the "ArrivalDateCap" value. For example for "ArrivalDateCap" - 2021-09-20 18:00:00, the chosen "ArrivalDateTime" will be "2021-09-20 16:37:00". Therefore, this "ArrivalDateTime" is less than "ArrivalDateCap" but the closest. The output which is df3 should be as below:-

df3  = {'ID': ['001', '002', '003','004'], 'ArrivalDateCap': ["2021-09-20 18:00", "2021-09-21 18:00", "2021-09-20 18:00","2021-09-25 16:00"],'ArrivalDateTime':['2021-09-20 16:37:00','2021-09-21 03:10:00','2021-09-20 16:37:00','2021-09-26 03:10:00'],'ID1':['A12','A13','A12','A14']}  
df3 = pd.DataFrame(df3)  

I presume a comparison of 'ArrivalDateCap' and 'ArrivalDateTime' columns and recording the results in another dataframe if they match the condition (less than and closest) makes sense. How can I go around implementing this? Thank you in advance.

CodePudding user response:

Use pd.merge_asof with direction='nearest' to get closest matches.

df3 = pd.merge_asof(df2.sort_values('ArrivalDateCap'), 
                    df1.sort_values('ArrivalDateTime'),
                    left_on='ArrivalDateCap', right_on='ArrivalDateTime',
                    direction='nearest')

Output:

>>> df3
    ID      ArrivalDateCap  ID1     ArrivalDateTime
0  001 2021-09-20 18:00:00  A12 2021-09-20 16:37:00
1  003 2021-09-20 18:00:00  A12 2021-09-20 16:37:00
2  002 2021-09-21 18:00:00  A13 2021-09-21 03:10:00
3  004 2021-09-25 16:00:00  A13 2021-09-21 03:10:00
  • Related