Pandas - Matching rows in 2 dataframes by non-unique key without reusing rows-CodePudding

I have 2 Pandas dataframes which I want to join:

DF_Transactions

H1	H_date
A	2-Jan-20
B	2-Jan-20
C	3-Jan-20
A	2-Jan-20
B	2-Jan-20

DF_Events

H1	H_date	H_class
A	2-Jan-20	class_1
B	2-Jan-20	class_2

Using pd.merge(DF_transactions, DF_Events, on=['H1'], how='left') I have obtained too many duplicates

DF_Wrong_Results

H1	H_date	H_class
A	2-Jan-20	class_1
B	2-Jan-20	class_1
C	3-Jan-20
A	2-Jan-20	class_2
B	2-Jan-20	class_2
A	2-Jan-20	class_1
B	2-Jan-20	class_1
A	2-Jan-20	class_2
B	2-Jan-20	class_2

I cannot use drop_duplicates on the dataframe as it would remove the last 2 entries in the left dataframe. What is the simplest way to get the correct dataframe below?

DF_Correct_Results

H1	H_date	H_class
A	2-Jan-20	class_1
B	2-Jan-20	class_2
C	3-Jan-20
A	2-Jan-20	class_1
B	2-Jan-20	class_2

CodePudding user response：

according to pandas documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html your code should be as such

DF_transactions.merge(DF_Events, on=['H1'], how='left')

It should be (left dataframe).merge((right dataframe),) syntax. Try again and see if it works.

CodePudding user response：

Have a look at merge_asof:

pd.merge_asof(df1.sort_values(['H1', 'H_date']), 
              df2, 
              on='H_date', 
              by='H1')

  H1     H_date  H_class
0  A 2020-01-02  class_1
1  A 2020-01-02  class_1
2  B 2020-01-02  class_2
3  B 2020-01-02  class_2
4  C 2020-01-03      NaN