I have 2 Pandas dataframes which I want to join:
DF_Transactions
H1 | H_date |
---|---|
A | 2-Jan-20 |
B | 2-Jan-20 |
C | 3-Jan-20 |
A | 2-Jan-20 |
B | 2-Jan-20 |
DF_Events
H1 | H_date | H_class |
---|---|---|
A | 2-Jan-20 | class_1 |
B | 2-Jan-20 | class_2 |
Using pd.merge(DF_transactions, DF_Events, on=['H1'], how='left') I have obtained too many duplicates
DF_Wrong_Results
H1 | H_date | H_class |
---|---|---|
A | 2-Jan-20 | class_1 |
B | 2-Jan-20 | class_1 |
C | 3-Jan-20 | |
A | 2-Jan-20 | class_2 |
B | 2-Jan-20 | class_2 |
A | 2-Jan-20 | class_1 |
B | 2-Jan-20 | class_1 |
A | 2-Jan-20 | class_2 |
B | 2-Jan-20 | class_2 |
I cannot use drop_duplicates on the dataframe as it would remove the last 2 entries in the left dataframe. What is the simplest way to get the correct dataframe below?
DF_Correct_Results
H1 | H_date | H_class |
---|---|---|
A | 2-Jan-20 | class_1 |
B | 2-Jan-20 | class_2 |
C | 3-Jan-20 | |
A | 2-Jan-20 | class_1 |
B | 2-Jan-20 | class_2 |
CodePudding user response:
according to pandas documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html your code should be as such
DF_transactions.merge(DF_Events, on=['H1'], how='left')
It should be (left dataframe).merge((right dataframe),) syntax. Try again and see if it works.
CodePudding user response:
Have a look at merge_asof:
pd.merge_asof(df1.sort_values(['H1', 'H_date']),
df2,
on='H_date',
by='H1')
H1 H_date H_class
0 A 2020-01-02 class_1
1 A 2020-01-02 class_1
2 B 2020-01-02 class_2
3 B 2020-01-02 class_2
4 C 2020-01-03 NaN