Vectorized way to find first occurrence per row-CodePudding

I have two Pandas DataFrames df_x and df_y. df_x has two columns 'high target' and 'low target'. Per every row of df_x, I would like to search through the instances of df_y and see whether the 'high target' was reached before the 'low target'. Currently, I implemented the above using .apply. However, my code is too inefficient as it linearly scales with the number of rows in df_x. Any suggestions to optimize/vectorize my code?

def efficient_high_after_low(row, minute_df):
    """True, if high happened after the low, else False.
    Args:
        row: A pandas dataframe
             with high, low,
        minute_df: the whole dataframe
    """

    minute_df_after = minute_df.loc[row.period_end_idx pd.Timedelta(minutes=1):]
    #print(minute_df_after)
    first_highs = (minute_df_after.ge(row['high target']))
    first_lows = (minute_df_after.le(row['low target']))
    
    hi_sum, lo_sum = first_highs.sum(), first_lows.sum()
    if (len(first_highs) != len(first_lows)):
        raise Exception('Unequal length of first_highs and first_lows')
    else:
        if ((len(first_highs) == 0)):
            return None
    
        elif ((hi_sum == 0) & (lo_sum != 0)):
            return True
        elif ((hi_sum != 0) & (low_sum == 0)):
            return False
        elif ((hi_sum == 0) & (low_sum == 0)):
            return None 
        elif (first_highs.idxmax() > first_lows.idxmax()):
            return True
        elif(first_highs.idxmax() < first_lows.idxmax()):
            return False
        else:
            return None

And I do the following to get these boolean values:

df_x.apply(efficient_high_after_low, axis=1, args=(df_y['open'],))

Running the code above on the first 1000 lines takes 4 seconds.

CodePudding user response：

Is this totally off? I dont have your dataframe to test it, but I would think it does what you want.

df_x = pd.DataFrame(np.random.randint(low=-100,high=500,size=500000), columns=["open"], index=[str(i) for i in range(500000)])
df_x["open"] = df_x["open"].cumsum()/100000
df_x["high target"] = df_x["open"] 0.1
df_x["low target"] = df_x["open"]-0.1

# This is just a place holder which keeps .sum() count correct
df_x["low_first"] = 0
for nrow in range(len(df_x.index)):
    # How does Open continue nrow steps in to the future
    df_x["nrow"] = df_x["open"].shift(-nrow)
    if df_x["low_first"].abs().sum()>=df_x["nrow"].notna().sum(): break #No needless searching
    # Did we break our high thresh hold?
    df_x["low_first"] = df_x["low_first"] (df_x["high target"] <= df_x["nrow"])*(1-df_x["low_first"].abs())
    # Did we break our low thresh hold?
    df_x["low_first"] = df_x["low_first"] (df_x["low target"] >= df_x["nrow"])*(df_x["low_first"].abs()-1)


# Cleaning up
df_x = df_x.drop(columns=["nrow"])
# And if you want your low/high columns
df_x["first_high"] = df_x["low_first"].replace(-1,0)
df_x["first_low"] = df_x["low_first"].replace(1,0)*(-1)

CodePudding user response：

This is what you could do:

First of all put the open column in your main dataframe, let's call it df (note: this only works if you have the exact same index on df_y, if you don't, consider other solutions like pd.concat or pd.merge_asof)

df = df_x
df["open"] = df_y["open"]

I also took the liberty of renaming your columns.

As long as your timeseries index is ordered, we can reset the index with

df = df.reset_index()

So now we have df something like this (values are made up):

   high_trgt    low_trgt    open
0   8.746911    8.712824    9.243329
1   9.472977    10.190079   9.744083
2   9.445111    10.269676   9.859353
3   9.972061    10.014381   9.132204
4   8.934692    8.914729    11.453276

# You "time" column isn't actually necessary for this solution

We can create a map of where the targets have been hit

map_high = df.open.values > df.high_trgt.values
map_low = df.open.values > df.low_trgt.values

Now the resource intensive bit:

df["high_was_hit_on"] = pd.Series([map_high[i:].argmax() for i in range(len(map_high))])
df["low_was_hit_on"] = pd.Series([map_low[i:].argmax() for i in range(len(map_low))])

Output:

    high_trgt   low_trgt    open        high_was_hit_on low_was_hit_on
0   8.746911    8.712824    9.243329    0               0
1   9.472977    10.190079   9.744083    0               3
2   9.445111    10.269676   9.859353    0               2
3   9.972061    10.014381   9.132204    1               1
4   8.934692    8.914729    11.453276   0               0

What I did here is iterating over the range of the columns, and just checking where the highest value (hence, index) starting from that row is in the map that we created before.

Now we can easily check which happened first by doing:

# Here you can customize what you need the results to be
# when two hits happen at the same time
df["high_after_low"] = df.high_was_hit_on < df.low_was_hit_on

In terms of speed, this is the test over a df with 1M rows:

%timeit find_first_hit(df)
3.46 s ± 253 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Truth be told, this isn't exactly vectorized, but I can't think of anything that you could do to really achieve that here. Hope that my solution is helpful anyway.