I have a stock dataset which for the sake of simplicity of this question only has a 'close' column and looks like this:

import pandas as pd

df = pd.DataFrame({'close': [1, 1.01, 1.015, 1.0, 0.98]})

print(df)

   close
0  1.000
1  1.010
2  1.015
3  1.000
4  0.980

Now I want to classify all datapoints wether they would be a profitable long opportunity. A profitable long opportunity means, that the future prices will hit a specific level which will be greater or equal to the specified win level. Contrary to that, it will not be profitable if it hits a loss level before.

I'm currently using this function:

def classify_long_opportunities(
        df: pd.DataFrame,
        profit_pct: float = 0.01,  # won after one percent
        loss_pct: float = 0.01,    # lost after one percent
):
    res = []
    for t in df[['close']].itertuples():
        idx, c = t                                      # extract index and close price
        win  = c   c * profit_pct                       # calculate the desired win level
        loss = c - c * loss_pct                         # "" loss level
        for e in df[['close']].loc[idx:].itertuples():  # iterate through the future prices
            _, p_c = e                                  # extract posterior closing price
            if c < win <= p_c:                          # found a fitting profit level
                res.append(1)
                break
            elif c > loss >= p_c:                       # found a losing level
                res.append(-1)
                break
        else:                                           # didn't found fitting level at all
            res.append(0)

    df['long_opportunities'] = res

The function classifies correctly:

classify_long_opportunities(df=df)

print(df)

   close  long_opportunities
0  1.000                   1
1  1.010                  -1
2  1.015                  -1
3  1.000                  -1
4  0.980                   0

But it's very slow. How can I optimize this function with the use of vectorization like numpy.where or numpy.select or a pandas function?

CodePudding user response：

IIUC, you can use a cumulated min/max and numpy.select:

import numpy as np

profit_pct = 0.01
loss_pct = 0.01

m1 = df['close'].mul(1 profit_pct).ge(df.loc[::-1,'close'].cummax())
m2 = df['close'].mul(1-loss_pct).le(df.loc[::-1,'close'].cummin())

df['long_opportunities'] = np.select([m2, m1], [0, -1], 1)

print(df)

output:

   close  long_opportunities
0  1.000                   1
1  1.010                  -1
2  1.015                  -1
3  1.000                  -1
4  0.980                   0

CodePudding user response：

Optimisation of your code

Keeping the exact same idea, including iteration, your code can be vectorized a bit, loosing at least the inner for loop

def classOpt(df, profit_pct=1.01, loss_pct=0.99):
    vals=df.close.values
    res=[]
    for i in range(len(df)):
        win=vals[i]*profit_pct
        loss=vals[i]*loss_pct
        futw=np.argmax(vals[i:]>=win)
        futl=np.argmax(vals[i:]<=loss)
        if (futw>0) and (futl==0 or futl>futw):
            res.append(1)
        elif (futl>0) and (futw==0 or futw>futl):
            res.append(-1)
        else:
            res.append(0)
    df['opt']=res

The idea is, at each stage, to at least work "vectorized" way on the array of future values. So at stage i, on vals[i:]. We get a bool array saying which future value is a win vals[i:]>=win. And which one is a loss vals[i:]<=loss.

With np.argmax we can easily get when this win or loss will occur, if it occurs. np.argmax(vals[i:]>=win). Note that since we included column i in the future values (as a sentinel in fact), we know that first boolean has to be False. So if that np.argmax(vals[i:]>=win) is 0, that means that there is no future win to come. If it is non 0, it is the number of days when the first future win will occur.

Likewise for future loss

So, result is 1, if a futw is non zero, and futl is either 0 or bigger than futw. That is, if there is a win to come, and either no loss, or a loss further in the future than the win to come (again, I find it a strange rule, but that is the one of your code)

The symmetric situation is a -1. Else 0.

Sliding window method

(Note: it is the fourth time in only a week or two, that I use this function in SO questions. A bit of recycling :-). In selected answer, btw, so, so far, it was really efficient. I fear this time, if mozway manage to correct result differences, that it will not go as well).

This method is based on the np.lib.stride_tricks.sliding_window_view function.

If M is [1,2,3,10,20,30,40], then sliding_window_view(M, (3,)) is

[[1,   2,  3],
 [2,   3, 10],
 [3,  10, 20],
 [10, 20, 30],
 [20, 30, 40]]

I think you see how it can be useful for computing with future values.

And one beauty of it, is that it is just a view. So no memory is really allocated for this (potentially huge otherwise) array.

In your case, because we want all future values, we need len(df) columns. And since we want that even for the last line, we need first to fill the values with some NaN. len(df)-1 NaN precisely, so that the last line can have exactly as much (void) predictions as the first line.

Then we have a len(df)×len(df) view. With the first column being the actual values. And each other columns being future values, at D 1, D 2, ...

From there, we just have to do the exact same thing as before, with argmax(...>win).

Here is the code

def slide(df, profit_pct=1.01, loss_pct=0.99):
    n=len(df)
    valswithnan=np.concatenate([df.close.values, [np.nan]*(n-1)])
    view=np.lib.stride_tricks.sliding_window_view(valswithnan, (n,))
    win=(view[:,0]*profit_pct).reshape(-1,1) # Column of win
    loss=(view[:,0]*loss_pct).reshape(-1,1) # of loss
    futw=np.argmax(view>=win, axis=1) # For each line, index of future win or 0
    futl=np.argmax(view<=loss, axis=1)
    res=(futw>0)*1 # res is 1 where there is a future win
    res[(futl>0) & ((futw>futl) | (futw==0))]=-1 # unless a future loss exists sooner
    df['slide']=res

Experimental setup

def gen(): # Something that looks like random variations. With equal opportunities to win/lose... return pd.DataFrame({'close':100 np.cumsum(np.random.normal(0, 1, (10000,)))})

df=gen()

Verify column differences between all 4 methods

'long_opportunies' for yours

opt for my 1st version

slide for my 2nd version with sliding_window_view

cuminmax for mozway's (but check fails for it. Pity, since timings rock)

def check(): df=gen() classify_long_opportunities(df) classOpt(df) slide(df) cuminmax(df) return ((df['long_opportunities']-df.opt)**2).sum(), ((df.opt-df.slide)**2).sum(), ((df.opt-df.cmm)**2).sum()


Ran dozens of check. All 3 methods (yours, and the 2 mine) give always the exact same result.

But timings...

Timings
-------
| Method | Timing |
| ------ | ------ |
| Your method | 14.66 s |
| My 1st | 240 ms |
| My 2nd | 152 ms |
| Mozway | 40 ms |


Note that `sliding_window_view` is not that impressive on this problem. I mean way less that the 3000× gain it gave in my previous usage on other problems. This has probably to do with lot of useless computation it does (a triangle of half of the view is full of  NaNs). Yet, it is still the fastest. Mozway's method is way faster, but result differs so far.