Home > Software design >  Pandas - Matching rows in 2 dataframes by non-unique key without reusing rows
Pandas - Matching rows in 2 dataframes by non-unique key without reusing rows

Time:12-29

I have 2 Pandas dataframes which I want to join:

DF_Transactions

H1 H_date
A 2-Jan-20
B 2-Jan-20
C 3-Jan-20
A 2-Jan-20
B 2-Jan-20

DF_Events

H1 H_date H_class
A 2-Jan-20 class_1
B 2-Jan-20 class_2

Using pd.merge(DF_transactions, DF_Events, on=['H1'], how='left') I have obtained too many duplicates

DF_Wrong_Results

H1 H_date H_class
A 2-Jan-20 class_1
B 2-Jan-20 class_1
C 3-Jan-20
A 2-Jan-20 class_2
B 2-Jan-20 class_2
A 2-Jan-20 class_1
B 2-Jan-20 class_1
A 2-Jan-20 class_2
B 2-Jan-20 class_2

I cannot use drop_duplicates on the dataframe as it would remove the last 2 entries in the left dataframe. What is the simplest way to get the correct dataframe below?

DF_Correct_Results

H1 H_date H_class
A 2-Jan-20 class_1
B 2-Jan-20 class_2
C 3-Jan-20
A 2-Jan-20 class_1
B 2-Jan-20 class_2

CodePudding user response:

according to pandas documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html your code should be as such

DF_transactions.merge(DF_Events, on=['H1'], how='left')

It should be (left dataframe).merge((right dataframe),) syntax. Try again and see if it works.

CodePudding user response:

Have a look at merge_asof:

pd.merge_asof(df1.sort_values(['H1', 'H_date']), 
              df2, 
              on='H_date', 
              by='H1')

  H1     H_date  H_class
0  A 2020-01-02  class_1
1  A 2020-01-02  class_1
2  B 2020-01-02  class_2
3  B 2020-01-02  class_2
4  C 2020-01-03      NaN
  • Related