Labeling whether the numbers in a dataframe is going up first or down first-CodePudding

Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].

Here is a loop:

import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})

for i in range (0,5):
      j = i
      while j in range (i,i 5) and df.at[i,'label'] == 0:  #if classfied, no need to continue
        if df.at[j,'B']-df.at[i,'A']>= 10:  
          df.at[i,'label'] = 1  #Label 1 means trending up
        if df.at[j,'B']-df.at[i,'A']<= -10: 
          df.at[i,'label'] = 2 #Label 2 means trending down
        j=j 1


    [out]
    A B  label
    0 1   1
    1 10  2
    2 -10 2
    3 2   0
    5 3   0 

    ...

The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)

What is a fast way to do this? Ideally without a loop.

CodePudding user response：

Looping on Dataframe is slow compared to using Pandas methods.

The task can be accomplished using Pandas vectorized methods:

rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic

Code

def set_trend(df, threshold = 10, window_size = 2):
    '''
        Use rolling_window to find max/min values in a window from the current point
    
        rolling window normally looks at backward values
    
        We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
        to look at forward values
    '''
    # To have a rolling window on lookahead values in column B
    #    We reverse values in column B
    df['B_rev'] = df["B"].values[::-1]
    #    Max & Min in B_rev, then reverse order of these max/min
    # https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
    df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
    df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
    nrows = df.shape[0] - 1     # adjustment for argmax & armin indexes since rows are in reverse order 
                                # i.e. idx = nrows - x.argmax() give index for max in non-reverse row
    df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
    df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
           
    # Use np.select to implement label assignment logic
    conditions = [
        (df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
        (df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
        df['max_'] - df["A"] >= threshold,        # max above threshold but didn't come first
        df['min_'] - df["A"] <= -threshold,       # min below threshold but didn't come first
    ]
    choices = [
        1, # max above & came first
        2, # min above & came first
        1, # max above threshold
        2, # min above threshold
    ]
    df['label'] = np.select(conditions, choices, default = 0)
    
    # Drop scratch computation columns
    df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)

    return df

Tests

Case 1

df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))

Case 2

df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))

Output

Case 1

    A   B   label
0   0   1   1
1   1   10  2
2   2   -10 2
3   3   2   0
4   5   3   0
5   0   0   0
6   0   0   0
7   0   0   0
8   0   0   0
9   0   0   0

Case 2

    A   B   label
0   0   1   2
1   1   -10 2
2   2   10  0