Python Pandas, add column containing first index where future column value is greater than current r-CodePudding

Is there a way to add a column that indicates the next index that meets some condition (ex. first index where a future row's val is greater than the current row's val) using a vectorized approach?

I found a number of examples that show how to do this using a fixed value, such as getting the next index where a column is greater than 0, but I am wanting to do this for every row based on that row's changing value.

Here's an example of doing this with simple loop, and I'm curious if there's a Pandas/vectorized approach to do the same:

import pandas as pd

df = pd.DataFrame( [0,2,3,2,3,4,5,6,5,4,7,8,7,2,3], columns=['val'], index=pd.date_range('20220101', periods=15))

def add_new_highs (df):

    df['new_high'] = pd.NaT
    for i,v in df.val.iteritems():
        row = df.loc[i:][ df.val > v ].head(1)
        if len(row) > 0:
            df['new_high'].loc[i] = row.index[0]

add_new_highs(df)
print(df)

Output:

            val   new_high
2022-01-01    0 2022-01-02
2022-01-02    2 2022-01-03
2022-01-03    3 2022-01-06
2022-01-04    2 2022-01-05
2022-01-05    3 2022-01-06
2022-01-06    4 2022-01-07
2022-01-07    5 2022-01-08
2022-01-08    6 2022-01-11
2022-01-09    5 2022-01-11
2022-01-10    4 2022-01-11
2022-01-11    7 2022-01-12
2022-01-12    8        NaT
2022-01-13    7        NaT
2022-01-14    2 2022-01-15
2022-01-15    3        NaT

CodePudding user response：

One option is to use numpy broadcasting. Since we want the index that appears after the current index, we only need to look at the upper triangle of an array; so we use np.triu. Then since we need the first such index, we use argmax. Finally, for some indices, there might never be a greater than value, so we replace those with NaN using where:

import numpy as np
df['new_high'] = df.index[np.triu(df[['val']].to_numpy() < df['val'].to_numpy()).argmax(axis=1)]
df['new_high'] = df['new_high'].where(lambda x: x.index < x)

Output:

            val   new_high
2022-01-01    0 2022-01-02
2022-01-02    2 2022-01-03
2022-01-03    3 2022-01-06
2022-01-04    2 2022-01-05
2022-01-05    3 2022-01-06
2022-01-06    4 2022-01-07
2022-01-07    5 2022-01-08
2022-01-08    6 2022-01-11
2022-01-09    5 2022-01-11
2022-01-10    4 2022-01-11
2022-01-11    7 2022-01-12
2022-01-12    8        NaT
2022-01-13    7        NaT
2022-01-14    2 2022-01-15
2022-01-15    3        NaT

CodePudding user response：

Similar to @enke's response

import numpy as np
arr = np.repeat(df.values, len(df), axis=1)  # make a matrix
arr = np.tril(arr)  # remove values before you
arr = (arr - df.values.T) > 0  # make bool array of larger values
ind = np.argmax(arr, axis=0)  # get first larger value index

df['new_high'] = df.iloc[ind].index  # use index as new row
df['new_high'] = df['new_high'].replace({df.index[0]: pd.NaT})  # replace ones with no-max as NaT