I'm trying to merge two df's, one df has a datetime
column, and the other has just a date
column. My application for this is to find yesterday's high
price using an OHLC dataset. I've attached some starter code below, but I'll describe what I'm looking for.
Given this intraday
dataset:
time current_intraday_high
0 2022-02-11 09:00:00 1
1 2022-02-11 10:00:00 2
2 2022-02-11 11:00:00 3
3 2022-02-11 12:00:00 4
4 2022-02-11 13:00:00 5
5 2022-02-14 09:00:00 6
6 2022-02-14 10:00:00 7
7 2022-02-14 11:00:00 8
8 2022-02-14 12:00:00 9
9 2022-02-14 13:00:00 10
10 2022-02-15 09:00:00 11
11 2022-02-15 10:00:00 12
12 2022-02-15 11:00:00 13
13 2022-02-15 12:00:00 14
14 2022-02-15 13:00:00 15
15 2022-02-16 09:00:00 16
16 2022-02-16 10:00:00 17
17 2022-02-16 11:00:00 18
18 2022-02-16 12:00:00 19
19 2022-02-16 13:00:00 20
...and this daily
dataframe:
time daily_high
0 2022-02-11 5
1 2022-02-14 10
2 2022-02-15 15
3 2022-02-16 20
...how can I merge them together, and have each row of the intraday
dataframe contain the previous (business) day's high price, like so:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
(Note the NaN
's at the top because we don't have any data for Feb 10, 2022 from the intraday dataset, and see how each row contains the intraday
data, plus the PREVIOUS day's max "high
" price.)
Minimal reproducible example code below:
import pandas as pd
###################################################
# CREATE MOCK INTRADAY DATAFRAME
###################################################
intraday_date_time = [
"2022-02-11 09:00:00",
"2022-02-11 10:00:00",
"2022-02-11 11:00:00",
"2022-02-11 12:00:00",
"2022-02-11 13:00:00",
"2022-02-14 09:00:00",
"2022-02-14 10:00:00",
"2022-02-14 11:00:00",
"2022-02-14 12:00:00",
"2022-02-14 13:00:00",
"2022-02-15 09:00:00",
"2022-02-15 10:00:00",
"2022-02-15 11:00:00",
"2022-02-15 12:00:00",
"2022-02-15 13:00:00",
"2022-02-16 09:00:00",
"2022-02-16 10:00:00",
"2022-02-16 11:00:00",
"2022-02-16 12:00:00",
"2022-02-16 13:00:00",
]
intraday_date_time = pd.to_datetime(intraday_date_time)
intraday_df = pd.DataFrame(
{
"time": intraday_date_time,
"current_intraday_high": [x for x in range(1, 21)],
},
)
print(intraday_df)
# intraday_df.to_csv('intradayTEST.csv', index=True)
###################################################
# AGGREGATE/UPSAMPLE TO DAILY DATAFRAME
###################################################
# Aggregate to business days using intraday_df
agg_dict = {'current_intraday_high': 'max'}
daily_df = intraday_df.set_index('time').resample('B').agg(agg_dict).reset_index()
daily_df.rename(columns={"current_intraday_high": "daily_high"}, inplace=True)
print(daily_df)
# daily_df.to_csv('dailyTEST.csv', index=True)
###################################################
# MERGE THE TWO DATAFRAMES
###################################################
# Need to merge the daily dataset to the intraday dataset, such that,
# any row on the newly merged/joined/concat'd dataset will have:
# 1. The current intraday datetime in the 'time' column
# 2. The current 'intraday_high' value
# 3. The PREVIOUS DAY's 'daily_high' value
# This doesn't work as the daily_df just gets appended to the bottom
# of the intraday_df due to the datetimes/dates merging
merged_df = pd.merge(intraday_df, daily_df, how='outer', on='time')
print(merged_df)
CodePudding user response:
pd.merge_asof
allows you to easily do a merge like this.
yesterdays_high = (intraday_df.resample('B', on='time')['current_intraday_high'].max()
.shift()
.rename('yesterdays_high')
.reset_index())
merged_df = pd.merge_asof(intraday_df, yesterdays_high)
print(merged_df)
Output:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
CodePudding user response:
Given your already existing code, you can map
the shifted values:
intraday_df['yesterdays_high'] = (intraday_df['time']
.dt.date
.map(daily_df['daily_high']
.set_axis(daily_df['time'].shift(-1)))
)
If you don't have all days and really want to map
the real previous business day:
intraday_df['yesterdays_high'] = (intraday_df['time']
.dt.date
.map(daily_df['daily_high']
.set_axis(daily_df['time'].add(pd.offsets.BusinessDay())))
)
Output:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
CodePudding user response:
We can use .dt.date
as an index to join two frames together on the same days. As of previous day hight_price
, we can apply shift
on daily_df
:
intra_date = intraday_df['time'].dt.date
daily_date = daily_df['time'].dt.date
answer = intraday_df.set_index(intra_date).join(
daily_df.set_index(daily_date)['daily_high'].shift()
).reset_index(drop=True)