I have a stock dataset which for the sake of simplicity of this question only has a 'close' column and looks like this:
import pandas as pd
df = pd.DataFrame({'close': [1, 1.01, 1.015, 1.0, 0.98]})
print(df)
close
0 1.000
1 1.010
2 1.015
3 1.000
4 0.980
Now I want to classify all datapoints wether they would be a profitable long opportunity. A profitable long opportunity means, that the future prices will hit a specific level which will be greater or equal to the specified win level. Contrary to that, it will not be profitable if it hits a loss level before.
I'm currently using this function:
def classify_long_opportunities(
df: pd.DataFrame,
profit_pct: float = 0.01, # won after one percent
loss_pct: float = 0.01, # lost after one percent
):
res = []
for t in df[['close']].itertuples():
idx, c = t # extract index and close price
win = c c * profit_pct # calculate the desired win level
loss = c - c * loss_pct # "" loss level
for e in df[['close']].loc[idx:].itertuples(): # iterate through the future prices
_, p_c = e # extract posterior closing price
if c < win <= p_c: # found a fitting profit level
res.append(1)
break
elif c > loss >= p_c: # found a losing level
res.append(-1)
break
else: # didn't found fitting level at all
res.append(0)
df['long_opportunities'] = res
The function classifies correctly:
classify_long_opportunities(df=df)
print(df)
close long_opportunities
0 1.000 1
1 1.010 -1
2 1.015 -1
3 1.000 -1
4 0.980 0
But it's very slow. How can I optimize this function with the use of vectorization like numpy.where
or numpy.select
or a pandas
function?
CodePudding user response:
IIUC, you can use a cumulated min/max and numpy.select
:
import numpy as np
profit_pct = 0.01
loss_pct = 0.01
m1 = df['close'].mul(1 profit_pct).ge(df.loc[::-1,'close'].cummax())
m2 = df['close'].mul(1-loss_pct).le(df.loc[::-1,'close'].cummin())
df['long_opportunities'] = np.select([m2, m1], [0, -1], 1)
print(df)
output:
close long_opportunities
0 1.000 1
1 1.010 -1
2 1.015 -1
3 1.000 -1
4 0.980 0
CodePudding user response:
Optimisation of your code
Keeping the exact same idea, including iteration, your code can be vectorized a bit, loosing at least the inner for loop
def classOpt(df, profit_pct=1.01, loss_pct=0.99):
vals=df.close.values
res=[]
for i in range(len(df)):
win=vals[i]*profit_pct
loss=vals[i]*loss_pct
futw=np.argmax(vals[i:]>=win)
futl=np.argmax(vals[i:]<=loss)
if (futw>0) and (futl==0 or futl>futw):
res.append(1)
elif (futl>0) and (futw==0 or futw>futl):
res.append(-1)
else:
res.append(0)
df['opt']=res
The idea is, at each stage, to at least work "vectorized" way on the array of future values. So at stage i, on vals[i:]
.
We get a bool array saying which future value is a win vals[i:]>=win
.
And which one is a loss vals[i:]<=loss
.
With np.argmax
we can easily get when this win or loss will occur, if it occurs.
np.argmax(vals[i:]>=win)
.
Note that since we included column i in the future values (as a sentinel in fact), we know that first boolean has to be False. So if that np.argmax(vals[i:]>=win)
is 0, that means that there is no future win to come. If it is non 0, it is the number of days when the first future win will occur.
Likewise for future loss
So, result is 1, if a futw
is non zero, and futl
is either 0 or bigger than futw
. That is, if there is a win to come, and either no loss, or a loss further in the future than the win to come (again, I find it a strange rule, but that is the one of your code)
The symmetric situation is a -1. Else 0.
Sliding window method
(Note: it is the fourth time in only a week or two, that I use this function in SO questions. A bit of recycling :-). In selected answer, btw, so, so far, it was really efficient. I fear this time, if mozway manage to correct result differences, that it will not go as well).
This method is based on the np.lib.stride_tricks.sliding_window_view
function.
If M
is [1,2,3,10,20,30,40]
, then sliding_window_view(M, (3,))
is
[[1, 2, 3],
[2, 3, 10],
[3, 10, 20],
[10, 20, 30],
[20, 30, 40]]
I think you see how it can be useful for computing with future values.
And one beauty of it, is that it is just a view. So no memory is really allocated for this (potentially huge otherwise) array.
In your case, because we want all future values, we need len(df)
columns.
And since we want that even for the last line, we need first to fill the values with some NaN. len(df)-1
NaN precisely, so that the last line can have exactly as much (void) predictions as the first line.
Then we have a len(df)×len(df)
view. With the first column being the actual values. And each other columns being future values, at D 1, D 2, ...
From there, we just have to do the exact same thing as before, with argmax(...>win)
.
Here is the code
def slide(df, profit_pct=1.01, loss_pct=0.99):
n=len(df)
valswithnan=np.concatenate([df.close.values, [np.nan]*(n-1)])
view=np.lib.stride_tricks.sliding_window_view(valswithnan, (n,))
win=(view[:,0]*profit_pct).reshape(-1,1) # Column of win
loss=(view[:,0]*loss_pct).reshape(-1,1) # of loss
futw=np.argmax(view>=win, axis=1) # For each line, index of future win or 0
futl=np.argmax(view<=loss, axis=1)
res=(futw>0)*1 # res is 1 where there is a future win
res[(futl>0) & ((futw>futl) | (futw==0))]=-1 # unless a future loss exists sooner
df['slide']=res
Experimental setup
def gen(): # Something that looks like random variations. With equal opportunities to win/lose... return pd.DataFrame({'close':100 np.cumsum(np.random.normal(0, 1, (10000,)))})
df=gen()
Verify column differences between all 4 methods
'long_opportunies' for yours
opt for my 1st version
slide for my 2nd version with sliding_window_view
cuminmax for mozway's (but check fails for it. Pity, since timings rock)
def check(): df=gen() classify_long_opportunities(df) classOpt(df) slide(df) cuminmax(df) return ((df['long_opportunities']-df.opt)**2).sum(), ((df.opt-df.slide)**2).sum(), ((df.opt-df.cmm)**2).sum()
Ran dozens of check. All 3 methods (yours, and the 2 mine) give always the exact same result.
But timings...
Timings
-------
| Method | Timing |
| ------ | ------ |
| Your method | 14.66 s |
| My 1st | 240 ms |
| My 2nd | 152 ms |
| Mozway | 40 ms |
Note that `sliding_window_view` is not that impressive on this problem. I mean way less that the 3000× gain it gave in my previous usage on other problems. This has probably to do with lot of useless computation it does (a triangle of half of the view is full of NaNs). Yet, it is still the fastest. Mozway's method is way faster, but result differs so far.