I'm trying to perform a new column calculation WITHOUT using a for
loop (using OHLC stock data) that denotes the rolling highest high, based on the True
value in another column. I've denoted a "first hour of trading" time window to be between 9:30 and 10:30. During that time, as we iterate through the rows, if "first hour of trading" == True
, and the current High is >= the previous High, then use that newer high. If we're NOT inside the "first hour of trading", just bring forward the previous value.
So, using this reproducible dataset:
import numpy as np
import pandas as pd
from datetime import datetime
# Create a reproducible, static dataframe.
# 1 minute SPY data. Skip to the bottom...
df = pd.DataFrame([
{
"time": "2021-10-26 9:30",
"open": "457.2",
"high": "457.29",
"low": "456.78",
"close": "456.9383",
"volume": "594142"
},
{
"time": "2021-10-26 9:31",
"open": "456.94",
"high": "457.07",
"low": "456.8",
"close": "456.995",
"volume": "194061"
},
{
"time": "2021-10-26 9:32",
"open": "456.99",
"high": "457.22",
"low": "456.84",
"close": "457.21",
"volume": "186114"
},
{
"time": "2021-10-26 9:33",
"open": "457.22",
"high": "457.45",
"low": "457.2011",
"close": "457.308",
"volume": "294158"
},
{
"time": "2021-10-26 9:34",
"open": "457.31",
"high": "457.4",
"low": "457.25",
"close": "457.32",
"volume": "172574"
},
{
"time": "2021-10-26 9:35",
"open": "457.31",
"high": "457.48",
"low": "457.18",
"close": "457.44",
"volume": "396668"
},
{
"time": "2021-10-26 9:36",
"open": "457.48",
"high": "457.6511",
"low": "457.44",
"close": "457.57",
"volume": "186777"
},
{
"time": "2021-10-26 9:37",
"open": "457.5699",
"high": "457.73",
"low": "457.5699",
"close": "457.69",
"volume": "187596"
},
{
"time": "2021-10-26 9:38",
"open": "457.7",
"high": "457.73",
"low": "457.54",
"close": "457.63",
"volume": "185570"
},
{
"time": "2021-10-26 9:39",
"open": "457.63",
"high": "457.64",
"low": "457.31",
"close": "457.59",
"volume": "164707"
},
{
"time": "2021-10-26 9:40",
"open": "457.59",
"high": "457.72",
"low": "457.46",
"close": "457.7199",
"volume": "167438"
},
{
"time": "2021-10-26 9:41",
"open": "457.72",
"high": "457.8",
"low": "457.68",
"close": "457.72",
"volume": "199951"
},
{
"time": "2021-10-26 9:42",
"open": "457.73",
"high": "457.74",
"low": "457.6",
"close": "457.62",
"volume": "152134"
},
{
"time": "2021-10-26 9:43",
"open": "457.6",
"high": "457.65",
"low": "457.45",
"close": "457.5077",
"volume": "142530"
},
{
"time": "2021-10-26 9:44",
"open": "457.51",
"high": "457.64",
"low": "457.4001",
"close": "457.61",
"volume": "122575"
},
{
"time": "2021-10-26 9:45",
"open": "457.61",
"high": "457.76",
"low": "457.58",
"close": "457.75",
"volume": "119886"
},
{
"time": "2021-10-26 9:46",
"open": "457.74",
"high": "457.75",
"low": "457.37",
"close": "457.38",
"volume": "183157"
},
{
"time": "2021-10-26 9:47",
"open": "457.42",
"high": "457.49",
"low": "457.37",
"close": "457.44",
"volume": "128542"
},
{
"time": "2021-10-26 9:48",
"open": "457.43",
"high": "457.49",
"low": "457.33",
"close": "457.44",
"volume": "154181"
},
{
"time": "2021-10-26 9:49",
"open": "457.43",
"high": "457.5898",
"low": "457.42",
"close": "457.47",
"volume": "163063"
},
{
"time": "2021-10-26 9:50",
"open": "457.45",
"high": "457.59",
"low": "457.44",
"close": "457.555",
"volume": "96229"
},
{
"time": "2021-10-26 9:51",
"open": "457.56",
"high": "457.61",
"low": "457.31",
"close": "457.4217",
"volume": "110380"
},
{
"time": "2021-10-26 9:52",
"open": "457.42",
"high": "457.56",
"low": "457.42",
"close": "457.47",
"volume": "107518"
},
{
"time": "2021-10-26 9:53",
"open": "457.475",
"high": "457.51",
"low": "457.4",
"close": "457.48",
"volume": "78062"
},
{
"time": "2021-10-26 9:54",
"open": "457.49",
"high": "457.57",
"low": "457.42",
"close": "457.46",
"volume": "133883"
},
{
"time": "2021-10-26 9:55",
"open": "457.47",
"high": "457.56",
"low": "457.45",
"close": "457.51",
"volume": "98998"
},
{
"time": "2021-10-26 9:56",
"open": "457.51",
"high": "457.54",
"low": "457.43",
"close": "457.43",
"volume": "110237"
},
{
"time": "2021-10-26 9:57",
"open": "457.43",
"high": "457.65",
"low": "457.375",
"close": "457.65",
"volume": "98794"
},
{
"time": "2021-10-26 9:58",
"open": "457.66",
"high": "457.69",
"low": "457.35",
"close": "457.45",
"volume": "262154"
},
{
"time": "2021-10-26 9:59",
"open": "457.45",
"high": "457.47",
"low": "457.33",
"close": "457.4",
"volume": "74685"
},
{
"time": "2021-10-26 10:00",
"open": "457.41",
"high": "457.48",
"low": "457.18",
"close": "457.38",
"volume": "166617"
},
{
"time": "2021-10-26 10:01",
"open": "457.39",
"high": "457.7",
"low": "457.39",
"close": "457.5",
"volume": "265649"
},
{
"time": "2021-10-26 10:02",
"open": "457.51",
"high": "457.57",
"low": "457.39",
"close": "457.53",
"volume": "131947"
},
{
"time": "2021-10-26 10:03",
"open": "457.53",
"high": "457.54",
"low": "457.4",
"close": "457.51",
"volume": "80111"
},
{
"time": "2021-10-26 10:04",
"open": "457.51",
"high": "457.62",
"low": "457.5",
"close": "457.6101",
"volume": "117174"
},
{
"time": "2021-10-26 10:05",
"open": "457.621",
"high": "457.64",
"low": "457.51",
"close": "457.58",
"volume": "168758"
},
{
"time": "2021-10-26 10:06",
"open": "457.58",
"high": "457.64",
"low": "457.46",
"close": "457.61",
"volume": "84076"
},
{
"time": "2021-10-26 10:07",
"open": "457.62",
"high": "457.7401",
"low": "457.62",
"close": "457.66",
"volume": "125156"
},
{
"time": "2021-10-26 10:08",
"open": "457.665",
"high": "457.69",
"low": "457.5",
"close": "457.67",
"volume": "116919"
},
{
"time": "2021-10-26 10:09",
"open": "457.69",
"high": "457.72",
"low": "457.5",
"close": "457.57",
"volume": "102551"
},
{
"time": "2021-10-26 10:10",
"open": "457.56",
"high": "457.75",
"low": "457.56",
"close": "457.7",
"volume": "109165"
},
{
"time": "2021-10-26 10:11",
"open": "457.7",
"high": "457.725",
"low": "457.63",
"close": "457.66",
"volume": "146209"
},
{
"time": "2021-10-26 10:12",
"open": "457.665",
"high": "457.88",
"low": "457.64",
"close": "457.86",
"volume": "210620"
},
{
"time": "2021-10-26 10:13",
"open": "457.855",
"high": "457.96",
"low": "457.83",
"close": "457.95",
"volume": "159975"
},
{
"time": "2021-10-26 10:14",
"open": "457.95",
"high": "458.02",
"low": "457.93",
"close": "457.95",
"volume": "152042"
},
{
"time": "2021-10-26 10:15",
"open": "457.96",
"high": "458.15",
"low": "457.96",
"close": "458.08",
"volume": "146047"
},
{
"time": "2021-10-26 10:16",
"open": "458.085",
"high": "458.17",
"low": "457.99",
"close": "458.15",
"volume": "100732"
},
{
"time": "2021-10-26 10:17",
"open": "458.17",
"high": "458.33",
"low": "458.155",
"close": "458.245",
"volume": "235072"
},
{
"time": "2021-10-26 10:18",
"open": "458.25",
"high": "458.29",
"low": "458.14",
"close": "458.16",
"volume": "422002"
},
{
"time": "2021-10-26 10:19",
"open": "458.17",
"high": "458.2801",
"low": "458.1699",
"close": "458.28",
"volume": "114611"
},
{
"time": "2021-10-26 10:20",
"open": "458.29",
"high": "458.39",
"low": "458.24",
"close": "458.37",
"volume": "241797"
},
{
"time": "2021-10-26 10:21",
"open": "458.37",
"high": "458.42",
"low": "458.31",
"close": "458.345",
"volume": "124824"
},
{
"time": "2021-10-26 10:22",
"open": "458.33",
"high": "458.49",
"low": "458.33",
"close": "458.47",
"volume": "132125"
}
])
... and this code which can go below the above df:
# Convert df to numeric and time to datetime re: the .csv to .json
# converter tool I used online...
df[['open','high','low','close','volume']] = df[['open','high','low','close','volume']].apply(pd.to_numeric)
df['time'] = pd.to_datetime(df['time'])
# When will the first hour of trading be?
startOfFirstHourTrading = datetime.strptime("0930", "%H%M").time()
endOfFirstHourTrading = datetime.strptime("1030", "%H%M").time()
# ...then using those times, define a boolean column denoting whether
# we're currently in the first hour of trading or not
df['isFirstHourTrading'] = np.where((df['time'].dt.time >= startOfFirstHourTrading) & (df['time'].dt.time < endOfFirstHourTrading), True, False)
df['FirstHourHigh'] = 0
# WORKING EXAMPLE
# Iterate through the df
for i in range(len(df)):
# If we're not currently within the first hour of trading, just
# bring forward the last value
if df['isFirstHourTrading'].iloc[i] == False:
df['FirstHourHigh'].iloc[i] = df['FirstHourHigh'].iloc[i-1]
continue
# ... otherwise if we ARE in the first hour of trading, keep track
# of the rolling highest high during the first hour
df['FirstHourHigh'].iloc[i] = max(df['high'].iloc[i], df['high'].iloc[i-1])
# Export the correct answer dataset to compare the next function
# to below
df.to_csv("correct_answers.csv", index=False)
# NON-WORKING EXAMPLE
# What I'd like to do is NOT use a for loop to do the above. I envision
# we can use np.where() and maybe a groupby() here? But I don't know how yet.
# Psuedo might look something like:
df['FirstHourHigh'] = df['high'].rolling(window=DYNAMIC_START_OF_FIRST_HOUR_TRADING_ROW?).max()
# ... but obviously doesn't work, and doesn't take into account if if df['isFirstHourTrading'] == True which it should
# Cross check for matches with correct_answers.csv
df.to_csv("correct_answers2.csv", index=False)
...you'll see that I've implemented what I want using a for
loop, to compare what I'm looking for to afterward. I'm looking to achieve the same thing as the for
loop but without using a for loop, just using a 1 line column calculation. Any ideas? Thanks!
CodePudding user response:
First get time as time, then mask values in the first trading hour, get their cumulative max, and bring the highest values forward:
>>> df['time'] = pd.to_datetime(df['time'])
>>> first_hour = (df['time'].dt.hour * 60 df['time'].dt.minute).between(9.5 * 60, 10.5 * 60)
>>> df['high'].where(first_hour).cummax().ffill()
Here’s what it does on a smaller example that actually has several days and dates otuside of the first order too:
>>> df
time high
0 2021-10-26 09:30:00 525
1 2021-10-26 10:00:00 504
2 2021-10-26 10:30:00 550
3 2021-10-26 11:00:00 567
4 2021-10-27 09:30:00 520
5 2021-10-27 10:00:00 576
6 2021-10-27 10:30:00 508
7 2021-10-27 11:00:00 532
>>> df['high'].where((df['time'].dt.hour * 60 df['time'].dt.minute).between(9.5 * 60, 10.5 * 60)).cummax().ffill()
0 525.0
1 525.0
2 550.0
3 550.0
4 550.0
5 576.0
6 576.0
7 576.0
CodePudding user response:
First things first, it's probably expected that you have a lot of data that doesn't fit the criteria of "first hour of trading" == True
, right? If that's the case we can start by filtering out that data so that we don't need to deal with the other case.
firstHourTrading = df[df["isFirstHourTrading"] == True]
If that's not a option that's okay too. We can instantly increase performance of the loop by using apply and using the previous row using the same algorithm you used in your for loop.
df.apply(lambda row: your_business_logic(row, row.shift()))
Alternatively you can try to use cummax
function from pandas if that will suit your needs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cummax.html
CodePudding user response:
pandas cummax function can be used over the isFirstHourTrading high column, as follows:
df["FirstHourHigh"] = (df["isFirstHourTrading"]*df["high"]).cummax()